Multimodal Pain Recognition Based on Contrastive Adversarial Autoencoder Pretraining

Nikolai A. K. Steur; Friedhelm Schwenker

doi:10.3390/make7040165

and

Institute of Neural Information Processing, Ulm University, James-Franck-Ring, 89081 Ulm, Germany

^*

Authors to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr.2025, 7(4), 165;https://doi.org/10.3390/make7040165

Version Notes

Order Reprints

Abstract

Background: Automated pain assessment aims to enable objective measurement of patients’ individual pain experiences for improving health care and conserving medical staff. This is particularly important for patients with a disability to communicate caused by mental impairments, unconsciousness, or infantile restrictions. When operating in the critical domain of health care, where wrong decisions harbor the risk of reducing patients’ quality of life—or even result in life-threatening conditions—multimodal pain assessment systems are the preferred choice to facilitate robust decision-making and to maximize resilience against partial sensor outages. Methods: Hence, we propose the MultiModal Supervised Contrastive Adversarial AutoEncoder (MM-SCAAE) pretraining framework for multi-sensor information fusion. Specifically, we implement an application-specific model to accomplish the task of pain recognition using biopotentials from the publicly available heat pain database BioVid. Results: Our model reaches new state-of-the-art performance for multimodal classification regarding all pain recognition tasks of ‘no pain’ versus ‘pain intensity’. For the most relevant task of ‘no pain’ versus ‘highest pain’, we achieve

84.22 %

accuracy (

F_{1}

-score:

83.72 %

), which can be boosted in practice to an accuracy of ≈

95 %

through grouped-prediction estimates. Conclusions: The generic MM-SCAAE framework offers promising perspectives for multimodal representation learning.

Keywords:

adversarial regularization; BioVid dataset; contrastive learning; information fusion; late fusion transformer; multimodal representation learning; ordinality-aware classification; pain recognition; variational autoencoder

1. Introduction

Nowadays, automated pain assessment is attracting increasing interest in the domain of medical and health care. This trend is driven by the desire to transform the inherent subjectivity of individual pain perception into objective quantification for optimizing patients’ treatment and to preserve human resources in clinical environments. A patient’s self-reporting about pain episodes and intensities that they experience currently serves as the de facto standard for pain measurement [1,2,3] and engenders direct consequences on the quality of medical treatment (cf. [1,3]), due to the role of pain as a strong indicator of harmful physical conditions, as well as an alarm signal for certain diseases. In cases of missing communicative abilities originating from mental impairments, loss of consciousness, or infantile restrictions, respective patients may be exposed to inappropriate drug dosing or even suffer from false diagnoses (cf. [2,3]).

These reasons motivate an ongoing search for meaningful physical signals to implicitly detect and quantify pain based on patients’ immediate pain reaction. Previous research attempts investigated facial expressions via video recordings [3,4], minimally or noninvasive measurements of biopotentials (i.e., heart rate, muscle activity and skin transpiration) [3,5,6,7,8], or combined all distinct modalities together in order to create automated pain assessment systems [3,9,10]. At first, such systems mostly relied on the engineering of handcrafted features [3,5,9,10,11], but with the rise of artificial neural networks as general-purpose technology, recent approaches [4,6,7,12,13,14] prefer end-to-end deep learning models concerning both feature extraction and pain classification.

Apart from the technological evolution of pain classification models, the focus has shifted from multimodal approaches [3,5,6,7,9,10,11,12] to highly specialized unimodal methods [13,14]. This development can be explained by the strong differences in the contribution per modality for solving the task of pain classification. For instance, there exists empirical evidence that Electrodermal Activity (EDA) is by far the most significant signal for pain classification [5,6,9,10,11] when compared to the other modalities of Electrocardiogram (ECG), Electromyography (EMG) and the recorded facial expressions provided by the prominent heat pain database BioVid (Part A) [1,2,3]. In particular, the modality of capturing facial expressions via video recordings probably lost attention because of its dependence on being awake, its costly clinical setup (cf. [11]) and the risk of intentional affectation.

Although the EDA signal quantifying the change in skin transpiration currently offers the highest potential for automated pain assessment, relying on a single sensor to implicitly measure pain is not reliable for decision-making in a sensitive domain such as health care. This argument is supported by the laboratory settings in which recent pain classification datasets were collected, where environmental factors are standardized and probands are in normal condition. But in a more realistic case, when patients suffer from diseases, receive medication and are exposed to external factors, pain-related body reactions may drastically change.

To address this problem, we propose the novel MultiModal Supervised Contrastive Adversarial AutoEncoder (MM-SCAAE) pretraining framework for multi-sensor information fusion and provide an application-specific implementation for the task of pain classification. We prove the performance of our concept using the publicly available BioVid dataset, which particularly ensures a reasonable comparison to existing methods. Due to the sustained difficulty of fine-grained pain classification, we continue working on the task of pain recognition between the no-pain baseline and individual pain intensities, as is common in the literature. Our model accesses the three biopotentials EDA, ECG and EMG (at the trapezius muscle) to accomplish the task of multimodal pain recognition.

The major contributions of our work are summarized below:

We derive a novel autoencoder variant for multimodal information fusion by combining denoising variational autoencoders with adversarial regularization and a supervised contrastive loss for global representation learning. This methodology constitutes the abstract definition of a new supervised adversarial autoencoder model that is application-independent and can easily be adopted for other multi-sensor classification tasks of time-series data.
The implementation of our designed MM-SCAAE framework achieves new state-of-the-art performance for multimodal pain recognition on the BioVid dataset regarding all four binary classification tasks (i.e., ‘no pain’ vs. ‘pain intensity’). Specifically, our model reaches an accuracy of $84.22 %$ and an $F_{1}$ -score of $83.72 %$ for the most relevant task of ‘no pain’ versus ‘highest pain’ using the three biophysiological measurements EDA, ECG and EMG.
Our empirical results indicate that a grouped-prediction estimate stemming from multiple short-term observations (here: $5.5$ s) may serve as key ingredient for the practical applicability of current pain assessment methods. In particular, the performance of our model can be boosted to the maximal accuracy of ≈95% for the task of ‘no pain’ versus ‘highest pain’ by approaching the full group size of twenty samples (i.e., in total a period of 110 s) per patient.

The organization of this paper is as follows: Section 2 provides an overview of the proposed MM-SCAAE framework and explains the theoretical foundations for each model component in detail. Section 3 relates our concept to previous works from a technological and an application-specific perspective. Section 4 introduces the BioVid dataset used for pain classification, presents the preliminaries for the intended analyses and elucidates the results of the conducted experiments. Section 5 discusses the implications of our work and offers prospects of promising future research directions. Finally, Section 6 concludes with the major findings of the present paper.

2. Methods

2.1. Model Overview

Figure 1 provides an overview of the proposed MultiModal Supervised Contrastive Adversarial AutoEncoder (MM-SCAAE) pretraining model for the task of pain classification based on diverse biophysiological measurements. Essentially, the model can be divided into the three logical units of Denoising Variational AutoEncoders (DVAE)s, a late fusion network and adversarial regularization.Through the design as an end-to-end learning framework, the three units dynamically shape interdependencies for hierarchical representation learning during the course of pretraining. The first level of representation learning is embodied by an individual DVAE per input modality, which is trained on a conventional reconstruction loss. Each application-specific encoder network is composed of a Multi-Scale Convolutional Neural Network (MSCNN) [15,16] for feature extraction with varying granularity, and a consolidating Multi-Layer Perceptron (MLP) for creating a Gaussian distribution around the latent representation. The respective decoder network recovers the original sample using an MLP and transposed convolutional layers (cf. [17]). Despite the denoising and variational criteria of the autoencoder components, the auxiliary augmentation of the input data artificially increases the training corpus to effectively prevent overfitting. The encoding spaces are regularized by the alignment with a prior distribution type to support the learning of continuous latent representations. However, distribution parameters are directly estimated from the empirical latent distributions. The adversarial networks for regularization follow an identical structure of a small MLP (e.g., with two stacked layers) which incurs marginal computational costs. In the second level of the representation learning, a late fusion network aggregates the unimodal encodings into a global object representation. The aggregation mechanism recognizes task-specific relationships between the modalities with the aid of a self-attention transformer [18] and condenses all relevant information into a global object vector normalized by a squash [19] function. The latent space of global objects also aligns with a prior distribution type through adversarial regularization, and, in particular, forms class partitions according to the supervised contrastive loss. After pretraining, a simple classifier can be trained on the global object space for performing downstream tasks.

Figure 1. The general architecture of the MM-SCAAE pretraining model for fusing the information from multi-sensor input data into a global object representation. A Denoising Variational AutoEncoder (DVAE) per input modality acts as a core component (center), which fosters the learning of an expressive latent space. The latent representation spaces are regularized by separated adversarial networks and a common fusion network. The adversarial regularization (left) aligns the output space of each encoder with a shared prior distribution type but individually adapts the distribution parameters. The late fusion network (right) aggregates the unimodal encodings into a global representation space with class partitions induced by the supervised contrastive objective. The global latent space is also regularized by an adaptive prior distribution. The processing course of the three displayed input modalities is highlighted through the individual colors of blue, yellow and (light) red.

Our proposed pretraining framework is best understood as a reformulation of the well-known variational lower bound [20,21] with a modified regularization

R_{Latent}

of the latent representation space:

E_{X} [- log p_{θ} (X)] \leq E_{X} [L_{Recon .}^{'} + \underset{R_{Latent}}{\underset{︸}{(R_{Adv .} + R_{Con .}}})]

(1)

where

L_{Recon .}^{'}

,

R_{Adv .}

and

R_{Con .}

refer to the denoising reconstruction loss, the adversarial regularizations and the supervised contrastive loss, respectively. Here,

p_{θ} (X)

describes the maximum likelihood estimate over the true data-generating distribution

p_{data}

, and the expectation over the training data is expressed as

E_{X} = E_{X \sim p_{data}}

. In the following, we stepwise construct the mathematical framework of Equation (1) by elaborating each model component and elucidating its integration into the overall concept.

2.2. Variational Autoencoder

A Variational AutoEncoder (VAE) [20,21] introduces a stochastic encoder component

q_{ϕ} (Z | X)

into the classic autoencoder [22] framework for maximizing the Variational Lower Bound (VLB) in order to approach the true data distribution

p_{data} (X)

via the maximum likelihood estimate over

p_{θ} (X)

:

E_{X} [log p_{θ} (X)] \geq E_{X} E_{q_{ϕ} (Z | X)} [log \frac{p_{θ} (X, Z)}{q_{ϕ} (Z | X)}] = VLB .

(2)

Here, the model parameters comprise

θ = {ϕ, ψ}

, and are manifested within the nonlinear encoder

q_{ϕ} (Z | X)

and decoder

p_{ψ} (X | Z)

components, which are usually realized as neural networks. As a unique characteristic of the VAE, its encoder emits learned parameters for a locally centered distribution around the evidence—supplied by the input sample—from which latent representations are stochastically drawn using the reparameterization trick [20,21]. The multivariate Gaussian is typically chosen with predicted mean and log-variance vectors by the encoder. At the time of inference, expectation of the emitted distribution acts as deterministic encoder output. The VLB can be rewritten as a composite objective of the model’s reconstruction ability (

- L_{Recon .}

) and the prior distribution alignment

R_{Prior}

with the latent encoding space, i.e.,

VLB = E_{X} [\underset{- L_{Recon .}}{\underset{︸}{E_{q_{ϕ} (Z | X)} [log p_{ψ} (X | Z)]}} - \underset{R_{Prior}}{\underset{︸}{D_{KL} (q_{ϕ} (Z | X) | | q (Z))}}]

(3)

where

D_{KL} (\cdot | | \cdot)

stands for the Kullback-Leibler (KL) divergence between two regarded distributions. Proofs for Equations (2) and (3) are provided in Appendix A. Often, a concrete prior distribution

q (Z)

is chosen. However, we vote for a relaxed formulation that solely specifies the distribution type and adaptively determines the distribution parameters during model training. This proceeding is inspired by previous VAE deliberations that either refrain from a prior distribution [23] or approximate a data-dependent prior using moving averages with momentum [12,24]. An isolated consideration of the

D_{KL}

-regularizer (

= R_{Prior}

) implies a penalty on the mutual information

I (X, Z)

between the evidence X and the latent variables Z, which can be interpreted as information processing costs that can be optionally weighted with an inverse temperature [12,24,25,26]. In this context, the optimal prior is determined via marginalization

q^{*} (Z) = E_{X} [q_{ϕ} (Z | X)] = \sum_{X} q_{ϕ} (Z | X) p_{data} (X)

over the complete evidence [24,25,26,27]. Interestingly, the equally weighted composition of the reconstruction ability and the

D_{KL}

-regularizer instead maximizes the mutual information, i.e.,

VLB = I (X, Z) + const .

(see last proof in Appendix A), leading to seemingly contradictory interpretations. However,

β

-VAE [28] makes clear that the application-specific balancing between reconstruction ability and prior distribution alignment forms a key ingredient in establishing a trade-off between competing optimization goals. A natural choice for an adaptive latent distribution represents the multivariate Gaussian

N (X | 0, σ^{2} I)

with zero-mean and estimated variance

σ^{2}

according to (cf. [12]):

σ_{t + 1}^{2} = (1 - \frac{1}{τ}) σ_{t}^{2} + \frac{1}{τ} Var [q_{ϕ} (Z | X_{t})],

(4)

where

σ_{t}^{2} = Var [q_{t} (Z)]

at time step t and the hyperparameter

τ

defines the sensitivity to value changes. Each modality owns an exclusive variance estimate. Since the scaling of features immensely influences the severity of their discrimination [29], we average variance updates over all feature dimensions to prevent overfitting.

2.2.1. Bernoulli Denoising Criterion

Denoising AutoEncoders (DAE)s [30,31] extend traditional autoencoders by recovering the original state of stochastically corrupted inputs for encouraging model robustness against input noise. In the context of Denoising Variational AutoEncoders (DVAE)s, the conventional VLB needs an update in terms of noisy input consumption. According to [32], we define the noise-injected encoder component as follows:

{\tilde{q}}_{ϕ} (Z | X) = \int q_{ϕ} (Z | \tilde{X}) p_{π} (\tilde{X} | X) d \tilde{X}

(5)

by means of a predefined corruption distribution

p_{π} (\tilde{X} | X)

, which entails the Denoising Variational Lower Bound (DVLB):

E_{X} [log p_{θ} (X)] \geq E_{X} E_{{\tilde{q}}_{ϕ} (Z | X)} [log \frac{p_{θ} (X, Z)}{q_{ϕ} (Z | \tilde{X})}] = DVLB .

(6)

The denoising criterion integrates to an equivalent optimization objective formulation as in Equation (3), but shapes stochastic latent neighborhoods around original samples:

DVLB = E_{X} [\underset{- L_{Recon .}^{'}}{\underset{︸}{E_{{\tilde{q}}_{ϕ} (Z | X)} [log p_{ψ} (X | Z)]}} - \underset{R_{Prior}^{'}}{\underset{︸}{D_{KL} (q_{ϕ} (Z | \tilde{X}) | | q (Z))}}] .

(7)

As mentioned in [31], optimizing against the DVLB is equivalent to maximizing the mutual information

I (X, Z)

between the uncorrupt data X and the latent factors Z from an information-theoretic perspective. Proofs for the statements in Equations (6) and (7) and for the maximization of mutual information are given in Appendix A. It is important to note that the positive effects on model robustness resulting from denoising regularization directly depend on the application-specific sensitivity to certain noise levels [32]. Although additive Gaussian noise is usually favored as default input corruption, we declare Bernoulli-based feature selection as a universal variant of noise injection. Due to the manipulation of concrete feature values using additive Gaussian noise, we argue that an appropriate trade-off between beneficial denoising and potential signal degeneration is less intuitive to find and frequently requires specific domain knowledge. On the contrary, a Bernoulli corruption distribution

p_{π} (\tilde{X} | X; B (π))

solely influences feature existence and can be easily implemented via Dropout [33]. In particular, theoretical analysis in [32] at least indicates that introducing further stochastic layers into the encoder network can tighten the VLB.

2.2.2. Adversarial Regularizer

An Adversarial AutoEncoder (AAE) [34] replaces the

D_{KL}

-regularizer of the VAE with a competitive network to discriminate between the learned posterior and the desired prior by means of random samples drawn from both distributions. According to the conventional Generative Adversarial Network (GAN) [35] training procedure, this can be realized as alternate optimization leading to the latent regularization

R_{Adv .} = - log (4) + 2 \cdot D_{JS} (q_{ϕ} (Z | X) | | q (Z))

(8)

for the encoder network (also known as the generator), given that the discriminator is optimal. Here,

D_{JS} (\cdot | | \cdot)

corresponds to the Jensen-Shannon (JS) divergence

D_{JS} (p (X) | | q (X)) = \frac{1}{2} [D_{KL} (p (X) | | \frac{p (X) + q (X)}{2}) + D_{KL} (q (X) | | \frac{p (X) + q (X)}{2})]

, which symmetrizes the KL divergence of the distributions

p (X)

and

q (X)

with their normalized and equally weighted mixture as the reference distribution. In particular, the mixture reference distribution smooths the regularization loss for the case of heavy deviations in

q_{ϕ} (Z | X)

from small values in

q (Z)

. Both of these adjustments secure a more sophisticated distribution discrimination criterion compared to the classic

D_{KL}

-regularizer. As unique property, the adversarial regularizer potentially has access to universal function approximation capabilities for distribution discrimination. This property promotes full space exploitation of the target/prior distribution. In contrast to this, the non-parametric

D_{KL}

-regularizer could easily be fooled with the use of sample repetition or punctual space exploitation. A descriptive demonstration of this phenomenon can be found in the primary AAE paper [34]. Due to the inclusion of the previously described Bernoulli denoising criterion, our framework builds upon Denoising Adversarial AutoEncoders (DAAE)s [36] with probabilistic encoders (borrowed from VAEs) as central components. There exists empirical evidence that DAAEs can improve classification performance on downstream tasks and coherence in sample synthesis after one additional iteration [36].

2.3. Supervised Contrastive Objective

Supervised Adversarial AutoEncoders (SAAE)s [34] make use of available label information to partition latent representations in accordance with a priori class memberships. We propose a novel implementation of the SAAE framework by incorporating a Supervised Contrastive (SupCon) training objective for indirectly backpropagating class association from a common fusion network through all unimodal encoders. For this purpose, we design an adjusted similarity measure for latent representations, present the used SupCon loss and develop a loss variant for class-ordinality injection. The effectiveness of our general SAAE implementation is empirically verified with descriptive results for the MNIST [37,38] dataset, which are provided in Appendix D.

2.3.1. Similarity Measure

The

squash (x)

function [19] is regularly used in the context of Capsule Networks (CapsNet)s [19,39] as a special activation function to scale the second norm

| | \cdot | |

of a vector output

x

to be in the range of

[0, 1]

, while preserving original vector orientation. A typical use case for squash normalization is the interpretation of vector magnitude as the existence probability of an observed entity and vector elements as the entity instantiation parameters [19]. The entanglement of distributed representations (i.e., capsules [40]) with their entity existence probabilities improved model performance in previous work concerning text classification [41] and image recognition [39]. Consistently, we enhance traditional contrastive learning by involving model prediction confidence in latent representations into the similarity consideration between input samples using the squash function [19]:

\hat{x} = {squash}_{α} (x) = \frac{α + {| | x | |}^{2}}{1 + {| | x | |}^{2}} \frac{x}{| | x | |}, | | \hat{x} | | \in [α, 1],

(9)

where we include the constant

α \in [0, 1]

as a smoothing term. In all of our experiments, we limit the lower bound of vector magnitudes to

α = 0.1

for ensuring numerical stability and fostering faster optimization convergence. This proceeding constitutes a relaxation of the classic contrastive learning constraint based on unit vectors, due to the integration of sample relevance, which attenuates the overestimation of outliers. Incorporating the squash function into the inner product

⟨\cdot, \cdot⟩

leads to a confidence-aware version of the cosine-similarity

s (x, y) = \frac{⟨\hat{x}, \hat{y}⟩ + 1}{2} \in [0, 1]

(10)

between the two vectors

x

and

y

, which we additionally normalize to positive values, with one as the upper bound. A more detailed discussion about the impact of prediction confidence on the defined similarity measure is provided in Appendix B.

2.3.2. $ϵ$ –SupInfoNCE Loss

Ref. [42] presented the first SupCon loss that deviated from the self-supervised scheme of attracting augmented versions of an anchor sample and repelling all other samples, by instead accessing predefined labels for class discrimination. The

ϵ

–SupInfoNCE loss [43] can be seen as an advancement of the classical SupCon loss by eliminating the implicit constraint of collapsing same-class instances to a single point and by introducing an

ϵ

-margin hyperparameter for fine-tuning metric-based learning. The

ϵ

–SupInfoNCE loss is defined as follows [43]:

R_{Con .}^{Nom .} = - \frac{1}{| P |} \sum_{i} log (\frac{exp (s_{i}^{+})}{exp (s_{i}^{+} - ϵ) + \sum_{j} exp (s_{j}^{-})})

(11)

where

s_{i}^{+}

and

s_{j}^{-}

indicate the similarity between two latent representations from samples of the same class and from samples of distinct classes, respectively. We adopt the scaling factor

1 / | P |

from [42] to relate the SupCon loss to the number of positive samples

| P |

for each temporary anchor within a batch. Specifically, Equation (11) represents a nominal loss function with regard to multi-class problems by equally weighting distances between differing classes.

2.3.3. Ordinal $ϵ$ –SupInfoNCE Loss

An ordinal version of the

ϵ

–SupInfoNCE loss requires an application-specific alignment of the

ϵ

-margin with intrinsic class relationships. In general, this requirement can be formalized in three steps. At first, we state the similarity of a latent representation

x_{i}^{(n)}

relative to the latent representation of a selected anchor sample

x_{a}^{(p)}

as follows:

s_{i}^{(n)} = s (x_{i}^{(n)}, x_{a}^{(p)})

(12)

where the superscripts

n, p \in N

signify belonging to a certain class through its ordinal label. Here, the case

n = p

is explicitly included and would indicate a positive sample in terms of the nominal

ϵ

–SupInfoNCE loss. In the next step, we create a training objective specification between two samples from different classes with

{s_{j}^{(n)} - s_{i}^{(p)} + f_{ϵ} (r_{n p}) \leq 0 | n \neq p} \forall i, j

(13)

where the function

f_{ϵ} (r_{n p})

defines the minimal

ϵ

-margin between both regarded samples based on their absolute class label difference

r_{n p}

. The class label difference

r_{n p}

can especially be seen as relative rank distance because of its integer nature. To fulfill the ordinality constraint,

f_{ϵ} (\cdot)

must be a strictly monotonically increasing function by means of the rank-based distance between classes. Both of these properties are summarized in the following statements:

r_{n p} = | n - p |, f_{ϵ} (x) < f_{ϵ} (y) \forall x < y .

(14)

Using the optimization objective in Equation (13), we can devise an ordinality-aligned version of the original

ϵ

–SupInfoNCE loss in the form of

R_{Con .}^{Ord .} = - \frac{1}{| P |} \sum_{i} log (\frac{exp (s_{i}^{p})}{exp (s_{i}^{p}) + \sum_{j} exp (s_{j}^{n} + f_{ϵ} (r_{n p}))})

(15)

where we omit the parenthesis of the class labels in the superscript for the benefit of a compact notation. The loss derivation of Equation (15) follows the proceeding in [43] and can be seen in Appendix C.

2.4. Mutual Information Constraints

Mutual information regularizations were previously included within hierarchical models to facilitate efficient information processing with limited resources [12,23,24,25,26]. This idea is grounded in the principle of information-theoretic bounded rationality [27], which regards information processing costs as additional factor for decision-making. In view of the disentanglement of latent variables,

β

-VAE [28] also suggests the integration of a task-specific weighting on the KL divergence regularization into the classical VAE framework, but still remains with a static prior distribution. Since we prefer adaptive prior distributions in our approach, we can analogously formulate mutual information constraints for the pretraining of our multi-stage pain classification model with the aid of Lagrange multipliers

β_{i} \in R^{+}

as in [12,24,25,26]:

\begin{matrix} R_{MI}^{pre} & = \frac{1}{β_{1}} \sum_{i = 1}^{C} I (X_{i}, Z_{i}) + \frac{1}{β_{2}} I (V, {Z_{1}, . ., Z_{C}}) \end{matrix}

(16)

\begin{matrix} = \frac{1}{β_{1}} \sum_{i = 1}^{C} log \frac{q_{ϕ} (z_{i} | x_{i})}{q_{i} (z)} + \frac{1}{β_{2}} log \frac{p (v | z_{1}, . ., z_{C})}{p (v)} \end{matrix}

(17)

where

X_{i}

means the random variable of the i-th input channel,

Z_{i}

states the corresponding latent representation, and V encapsulates the global entity encoding received from the late fusion network. Although we formulate the partial goal to reduce mutual information based on the adversarial prior regularizations, the

ϵ

–SupInfoNCE loss particularly shapes a lower (

ϵ = 0

) or upper (

ϵ \to \infty

) bound on the mutual information [43,44]. These bounds can be further tightened with a proper choice of the

ϵ

-margin [43]. Our motivation behind this composite learning objective is to shift the focus from high-dimensional instance appearance (i.e., reconstruction ability) as a crucial attribute for mutual information toward latent features for class discrimination.

4. Results

4.1. BioVid Dataset

The Biopotential and Video (BioVid) Heat Pain Database (Part A) [1,2,3] is a publicly available pain classification dataset. It comprises time series of biophysiological signals combined with frontal video recordings (including depth data) of facial expressions gathered from a controlled experiment of external pain stimulation with 90 study participants. Since the data collection of 3 subjects suffers from missing values due to technical problems, we only concentrate on the remaining 87 individuals, as is common in the literature. The distribution of individuals per age group and sex is approximately balanced. Pain stimuli were induced as heat application on the skin of the right arm using four discrete heat intensities representing distinct pain levels. The lowest and the highest pain levels were determined during an initial calibration phase per patient. In this calibration phase, the temperature induced by the heat-application device was constantly increased and patients were asked to report their lower-bound pain threshold (i.e., if pain perception starts) and upper-bound pain tolerance (i.e., if perceived pain is unacceptable). The pain levels consist of the participant’s self-reported lower-bound pain threshold (

T_{1}

) and upper-bound pain tolerance (

T_{4}

) with two equally distant intermediate levels (

T_{2}

and

T_{3}

), which are proportionally mapped to the respective heat intensities. During the experiment, each subject experienced a randomized sequence of 80 heat intensities composed of 20 stimuli per pain level. After reaching the required temperature of a pain level, each heat stimulus was held for 4 s. Between successive stimuli, heat emission randomly paused for

t_{pause} \in [8, 12]

s plus the needed time for cooling down the heat module to the baseline temperature of 32 °C. In addition, baseline data without pain (

T_{0}

) were extracted from the pauses to obtain a time span with same magnitude to the 20 stimuli per pain level. Each sample captures a period of 5.5 s. The biophysiological signals encompass biopotentials measured via Electrocardiogram (ECG), Electrodermal Activity (EDA)/Galvanic Skin Response (GSR) and Electromyography (EMG) at the trapezius muscle. The temporal synchronization between the measurements was ensured with the aid of a specialized medical device for multimodal data acquisition. In this paper, we restrict our experiments to the multimodal recognition of the four ordinal pain levels in relation to the no-pain baseline based on these biophysiological signals.

4.2. Pain Recognition Task

We formalize the task of pain recognition as a supervised learning scenario of binary classification where emerging pain intensities (

T_{1}

,

T_{2}

,

T_{3}

,

T_{4}

) need to be separated from the patient’s normal state (

T_{0}

). For this purpose, a certain number C of the patient’s biopotentials are periodically monitored over a predefined time span S, resulting in a multimodal sample

X \in R^{C \times S}

. The training data of each pain recognition task comprises the normal state samples

{(X_{i}, T_{0})}_{i = 1}^{N}

and the pain episode samples

{(X_{j}, T_{k}) | k \neq 0}_{j = 1}^{N}

of pain intensity

T_{k}

. As a performance evaluation regime for each pain recognition task, the Leave-One-Subject-Out (LOSO) [4,6,7,8,11,12] cross validation is adopted. This means that for each run, the model performance is tested on a specific patient that was excluded from the training procedure. Finally, the overall model performance is determined as average over all LOSO runs. The practical relevance of the individual pain recognition tasks increases with the relevant pain intensity. Since the discrimination between neighbored pain intensities to date constitutes a notably challenging problem, it is common in the present literature to focus on the above binary classification tasks for evaluating a model’s performance.

4.3. Data Preprocessing and Augmentation

This section chronologically presents the implemented data preprocessing and augmentation pipeline for the task of pain recognition based on the multimodal biophysiological measurements from the BioVid heat pain database.

4.3.1. Instance-Based Normalization

We conduct min-max normalization channel-wise for each multimodal sample

X \in R^{T \times C}

, with C feature channels and a sequence length of T, over its temporal dimension by

{\tilde{x}}_{i, c} (t) = \frac{x_{i, c} (t) - min (x_{i, c})}{max (x_{i, c}) - min (x_{i, c})} \in [0, 1],

(18)

where

x_{i, c} \in R^{T}

refers to the complete time series of the c-th channel from the i-th sample, and

x_{i, c} (t)

denotes the single feature at time step t. Hence, we obtain a reasonable feature range per sample without harming the time-dependent structure. This sample-wise normalization strategy is inspired by Layer Normalization (LN) [58], which is regularly applied on layer-wise neural activities in conjunction with sequential data. Contrary to LN, we refrain from zero-mean and unit-variance normalization and omit the final reparameterization step of explicit signal re-scaling and re-shifting. Apart from this single instance-based normalization step applied to all modalities, no additional preprocessing is conducted. In the following, the data augmentation pipeline for the training process is presented.

4.3.2. Random Crop

As a straightforward data augmentation technique, we apply random cropping to each input sequence. When the i-th sample

X_{i} \in R^{T \times C}

with length T and C channels is drawn from the training set, a random window with fixed length t is selected to constitute a multi-channel augmented sequence

{\tilde{X}}_{i} = {(X_{i})}_{s \leq j \leq s + t}, s \sim U (0, T - t),

(19)

where j states the selected time frame, and s describes the shift from the start of the input sequence. The window shift s is sampled from the uniform distribution bounded by the minimal value of zero and the maximal value of

T - t

. Specifically, we clip windows of length

t = 2688

from input sequences of original length

T = 2816

. Taking into account the sample rate of 512 per second within the recorded signals from BioVid, we allow for maximal shifts of 250 ms. At the time of model inference, the window selection is deterministic, setting s to the rightmost position, i.e.,

s = T - t

. The rationale behind this proceeding derives from the expectation of occurring latency in biophysiological reactions on pain events. Temporal multi-channel alignment has been proven to form strong self-supervisory signals for robust learning of dynamic processes [53]. Previous approaches performed similar temporal cropping operations [6,7,12].

4.3.3. Modality Replacement

To facilitate the training of a patient-invariant pain classification model, we randomly replace each modality from a regarded sample with the respective modality of other samples from the identical pain class. Therefore, the process of augmenting the i-th sample

X^{(i, k)} \in R^{T \times C}

(with C channels and T features) belonging to class k can be mathematically described as

{\tilde{X}}^{(i, k)} = {(1 - δ) x_{c}^{(i)} + δ x_{c}^{(y)} | x_{c}^{(y)} \sim X_{B}^{(k)}; δ \sim B (π)}_{c = 1}^{C}

(20)

where

δ \in {0; 1}

induces a modality replacement received from a Bernoulli distribution

B (π)

, and

x_{c}^{(\cdot)}

means the c-th channel of a sample from class k within the batch

X_{B}^{(k)}

. Throughout our experiments, we have chosen a fair coin flip probability (i.e.,

π = 0.5

) for substituting feature channels. As a beneficial side-effect, modality replacement artificially expands the small training corpus of BioVid with regard to individual patients. Note that the modality replacement operation was already used in prior work [52,57] to produce hard negative samples for multimodal contrastive learning. This is opposed to our modality replacement variant, which identifies the interchangeability of sensory measurements within the same pain class as a substantial training target to engender unbiased representation learning.

4.3.4. Modality Dropout

Finally, we establish the task-specific data augmentation of modality dropout [52], which randomly disables single modalities according to

\tilde{X} = X (I ⊙ d), d = {d_{c} \sim B (π)}_{c = 1}^{C},

(21)

where a sample

X \in R^{T \times C}

with C channels and T temporal features is transformed using the column-wise product of the identity matrix

I_{C \times C}

and the dropout vector

d

. The elements of

d

are drawn from a Bernoulli distribution

B (π)

. Again, we use an equal probability (i.e.,

π = 0.5

) for switching off individual feature channels. In the extreme case of eliminating all modalities by chance, we simply turn on all channels instead. The motivation behind modality dropout is three-fold: Firstly, the model is enforced to focus on each modality for preventing the known problem of overestimating single feature channels [52] and promoting a more contextualized information fusion. Especially in the case of BioVid’s biophysiological recordings, empirical evidence suggests that the EDA measurement appears to be significantly more informative than the ECG and EMG modalities [5,6,9,10,11], which could narrow a model’s perspective. Secondly, in the practical application of pain assessment, an invariant, or at least a resilient, model prediction is highly desirable to compensate for partial outages of sensors. Thirdly, this data augmentation strategy promotes efficient model extensibility by means of connecting auxiliary sensors with reduced training cost, due to the reuse of frozen autoencoders which have already learned to align their encoded representations with the associated pain class. Expanding the scope of modalities for an object class can particularly strengthen representation learning [51,53].

4.4. Network Components

The generic MM-SCAAE pretraining model in Figure 1 requires the specification of all network components for the task of pain classification. Hence, we provide in Figure 2 the implementation details on the variational encoder, the decoder and the late fusion network. Since previous works [13,14] on pain classification using the EDA signal from BioVid empirically validated the effectiveness of MSCNN [15,16] for temporal feature extraction, we incorporate an analogous MSCNN into our variational encoder. In particular, we follow the basic MSCNN structure defined in [15] of parallel convolutional layers with a subsequent fully connected projection head for harmonized feature concatenation. Consistent with [13,14], we establish three scales by applying different-sized convolutional kernels on the raw sensory time series. However, we expand the smallest time window from 0.1 s to 0.5 s in order to better account for the latency in biophysiological reactions and to significantly reduce computational efforts. In Figure 2, we state temporal convolutional layers (Conv1D)s with the used activation function, the number of filters and the strides (as downsampling (/x) or upsampling (*x) factor). The first downsampling factor c of the MSCNN is

10 %

of the window size. Apart from the first Conv1D, all convolutional layers within the encoder have a kernel size of 10. A fully connected layer (Dense) is characterized by its activation function and its neuron count. The application of the Gaussian Error Linear Unit (GELU) [59] activation function within the encoder can induce positive effects on the model’s generalizability [13]. The encoder emits per feature dimension a Gaussian distribution centered around the latent representation by predicting the distributions’ mean

μ

and variance

σ^{2}

.

Figure 2. Implementation of MM-SCAAE’s model components for the task of multi-sensor pain classification. (a) Variational encoder component of each unimodal autoencoder. The encoder extracts features of varying granularity with window size

w \in {0.5, 1, 2}

in seconds using a Multi-Scale Convolutional Neural Network (MSCNN) and produces parameter estimates for a local Gaussian distribution. (b) Denoising decoder component of each unimodal autoencoder. The decoder receives a stochastically sampled latent representation drawn from the local Gaussian distribution predicted by the encoder, and reconstructs the uncorrupt input. (c) The late fusion network aggregates the multimodal information into a global object representation via two sequential self-attention transformer layers. Within each subfigure, identically colored network layers share the same layer type.

The decoder network draws a sample from the Gaussian and reconstructs the original input. The decoder is composed of an MLP and a subsequent transposed convolutional network (cf. [17]) with kernel sizes

{10, 10, 10, 1}

. The late fusion network at first creates distinct views of the modality encodings (with kernel size of 5), and then applies two sequentially appended self-attention transformer layers [18] (with three multi-head attention heads) to determine a global object representation. Finally, the global object vector is normalized to a magnitude in range of

[0.1, 1]

via the squash function from Equation (9). Except for the output layers, both the decoder and the fusion network conduct the Parametric Rectified Linear Unit (PReLU) [60] activation function with linear initialization. Rectified linear units foster sparsity in neural activity, which may mitigate overfitting problems [16]. Moreover, the adaptivity of PReLU, in combination with its linear initialization, generally accelerates solution space exploration and resulting training progress [29]. The simple architecture of each adversarial network consists of three Dense layers with

{256, 128, 1}

neurons per layer. Both hidden layers apply PReLU, while the output layer uses the logistic sigmoid. After pretraining the MM-SCAAE model, the DVAEs and the late fusion network are extracted and frozen in order to train a classifier network on top of the global object representations. The classifier has a 128-unit Dense layer with PReLU activation and a two-unit Dense output layer with softmax activation for pain class prediction.

4.5. Global Setup

We implemented our experiments in the programming language Python [61] (version: 3.12.8) using the machine learning library Keras [62] (version: 3.4.1), with Tensorflow [63] (version: 2.17.0) as backend. If parameters are not explicitly defined below, Tensorflow’s default parameter setting is used. Apart from output layers, each network layer integrates Batch Normalization [64] before applying the activation function. In addition, each layer conducts Dropout [33] with a probability of

0.2

on incoming signals. As a gradient descent optimizer for all training procedures, we utilize AdamW [65], which equips the classic Adam [66] algorithm with proper weight decay. In addition, AdamW is initialized with a learning rate of

0.001

, and the AMSGrad [67] option is activated for enhancing training convergence [29].

The further training configuration comprises a batch size of 128 and training epochs of

{50, 10}

for pretraining and classifier optimization, respectively. As a prior distribution type for adversarial regularization, a zero-mean Gaussian with adaptive variance is used (see Section 2.2). Each Gaussian is initialized as standard normal distribution (i.e., with variance

σ^{2} = 1.0

). The adaptive variance per adversarial regularizer is repeatedly estimated by calculating the empirical variance over the respective batch of latent representations during the forward propagation in each training step. We empirically choose a momentum factor of

\frac{1}{τ} = 0.01

in Equation (4) to ensure a relatively stable variance estimation for proper optimization. This cumulative variance estimation allows for the full description of the Gaussian prior distribution for adversarial loss computation. For balancing adversarial training, especially at the beginning, we add noise

\sim N (0, 0.05)

to the target logits of the adversarial networks. Following [42,43], we use a temperature value of

τ = 0.1

in each exponential expression, i.e.,

exp (\cdot / τ)

, of the SupCon losses. The reconstruction loss of the autoencoder model is implemented as Mean Squared Error (MSE). Although we equally weight all pretraining objectives, we observed in preliminary tests that MSE rapidly diminishes after a few epochs, leading to a prevalence of latent regularization, similarly to

β

-VAE. To control for the uncertainty in pain classes caused by the measurement of only a small fraction of biophysiological reactions, we include label smoothing [68] with a factor of

0.3

for non-target classes during classifier training with the cross entropy loss. In Appendix E, we additionally provide a structured list to the global hyperparameters for quick reference.

4.6. Experiments

Within the results of our experiments, model performances are mainly evaluated by means of classification accuracy to ensure a valid comparability to existing pain-recognition methods in the literature. This is necessary since the preferable classification metrics of sensitivity and specificity for evaluating a classifier’s quality in a sensitive domain such as medicine are to date rarely reported in terms of the BioVid dataset. However, we state in our first experiment the relevant classification metrics for appropriately categorizing the resulting performance of our proposed model in the application context of medicine.

4.6.1. Classification Performance

In our first experiment, we aim to evaluate the basic classification performance of our MM-SCAAE pretraining framework for the task of pain recognition. More precisely, we formulate four two-class problems where the no-pain baseline (

T_{0}

) needs to be separated from the different pain intensities (

T_{1}

,

T_{2}

,

T_{3}

,

T_{4}

). For this purpose, we consider for pretraining and subsequent classifier training solely the two selected classes per task (i.e.,

T_{0}

vs.

T_{i}

). In these two-class scenarios, the MM-SCAAE model employs the nominal

ϵ

–SupInfoNCE loss in Equation (11), with

ϵ = 1.0

as the margin hyperparameter. Table 1 summarizes the performance statistics for the LOSO cross validation per downstream task. Additionally, Figure 3 displays the corresponding Receiver-Operating Characteristic (ROC) curve with the Area Under the Curve (AUC), and the LOSO accuracy distribution per task. In Table 1, we list the classification metrics typically reported in the pain recognition literature in conjunction with BioVid. In particular, the sensitivity and specificity metrics take a special role in assessing a model’s quality for predicting the occurrence of a medical condition. As one would expect, the class separability improves with increasing pain intensity. Note that the final task (

T_{0}

vs.

T_{4}

) has the highest practical relevance for pain recognition, since the linear categorization of the intermediate pain intensities (

T_{1}

,

T_{2}

,

T_{3}

) lack a clear interpretation and medical justification. Our proposed model achieves competitive accuracies and macro-averaged

F_{1}

-scores for all tasks compared to the existing multimodal pain classification approaches [3,5,6,7,9,10,11,12]. Moreover, our ROC curve plot including AUC values is coherent with prior work [13] on unimodal classification based on the EDA signal. A detailed performance comparison will be provided in the following Section 4.6.4, which relates the capability of MM-SCAAE to previous models.

Table 1. Classification metrics with standard deviations for the LOSO cross validation using two-class pretraining. Except for Cohen’s

κ

, all metrics are stated in %. The best value per classification metric is stated in bold.

Figure 3. Classification performance evaluation per pain intensity (

T_{1}

,

T_{2}

,

T_{3}

,

T_{4}

) versus the no-pain baseline (

T_{0}

). (a) ROC curve and AUC values document the change in performance for various classification thresholds. (b) Violin plot for each LOSO accuracy distribution.

4.6.2. Modality Significance

This post-processing analysis investigates the contribution of each modality to final class prediction. Due to the modality dropout data augmentation technique, we obtain a prediction model resilient against partial sensor failure. Furthermore, modality dropout engenders the tracking of classification performance for all possible sensor combinations during inference. Thus, we evaluate the classification accuracy of the models from our previous experiment in terms of varying sensor availability, as illustrated in Table 2. A sensor can have a value of 0 or 1 meaning off or on, respectively. The results support the suggestion in [13,14] to focus on the EDA modality, since it delivers the strongest overall contribution to effective class separation in the unimodal cases. Nevertheless, our model emphasizes that unimodal evaluation is inferior for each classification task. It is interesting that the EMG modality entails positive or negative effects depending on the downstream task. For the most important task of no-pain baseline

T_{0}

versus highest pain

T_{4}

, each modality accomplishes a significant contribution to class prediction. In particular, we observe that involving only one auxiliary modality to EDA seems to rather degrade accuracy, whereas all three sensors together achieve the highest accuracy for the (

T_{0}

vs.

T_{4}

) task. These observations indicate that the global object representation space formed by the SupCon loss captures nonlinear relationships between the unimodal latent spaces. Finally, the successful composition of a global object space, despite the modality replacement operation, empirically proves effectual patient-invariant learning.

Table 2. Classification accuracy (in %) with standard deviation within the LOSO cross validation for varying sensor availability. Sensor values of 0 or 1 signify sensor state of deactivation or activation, respectively. The best accuracy value per classification task is stated in bold.

4.6.3. Ordinal-Aware Multi-Class Pretraining

The autoencoder architecture of our model allows for multi-class pretraining. Furthermore, the ordinal SupCon loss formulation enables the injection of domain-specific knowledge via predefined

ϵ

-margin function into the MM-SCAAE pretraining model. Thus, we analyze in our next experiment if the exploitation of further training data from other classes can improve the two-class prediction accuracy. The central motivation behind this proceeding results from the chronic deficit of labeled training data in health care. Specifically, if a pretraining model is able to access information from instances associated with intermediate or surrounding classes, learned representation spaces may profit from smoother transitions for downstream tasks. To test this hypothesis, we conduct multi-class pretraining with the MM-SCAAE model using the classes

C_{1} = {T_{0}, T_{2}, T_{4}}

and

C_{2} = {T_{0}, T_{1}, T_{2}, T_{3}, T_{4}}

. For both class sets we additionally vary the use of nominal and ordinal SupCon loss. The margin hyperparameter for the nominal SupCon loss is chosen as

ϵ = (| C_{i} {| - 1)}^{- 1}

, whereas the ordinal SupCon loss utilizes the margin function

f_{ϵ} (r_{i j}) = r_{i j} \cdot (| C_{i} {| - 1)}^{- 2}

with

r_{i j}

as label difference between the classes

T_{i}

and

T_{j}

. The square exponent within the

ϵ

-margin function helps to balance the magnitudes between the ordinal and nominal SupCon losses. Table 3 illustrates the experimental results for the distinct multi-class pretraining scenarios. Based on the presented results, we can make two observations. Firstly, involving other classes than the selected classes for the downstream task generally lowers model accuracy (except for the task

T_{0}

vs.

T_{2}

). This circumstance can be seen by comparing the accuracies in Table 3 with the former experimental results in Table 1. In addition, accuracy values further reduce from multi-class configuration

C_{1}

to

C_{2}

. Secondly, the ordinal SupCon loss at least slightly degrades model performance compared to the nominal loss version in the most cases. We conclude that each class from BioVid contains a significant noise ratio which accumulates with the inclusion of further classes into the pretraining procedure. Since the multi-class pretraining models gain no benefit from using the ordinal SupCon loss, it implies that there may not exist a gradually linear change in the original representation space of biophysiological pain reactions.

Table 3. Classification accuracy (in %) with standard deviation within the LOSO cross validation using distinct multi-class pretraining scenarios. The best accuracy value per classification task is stated in bold.

4.6.4. Performance Comparison

Table 4 relates the performance of MM-SCAAE on the BioVid dataset to the existing methods. The two separated sections display unimodal approaches and multimodal methods, respectively. Each section is chronologically sorted, starting with the earliest approach. For ensuring valid comparability to the performance of previous pain recognition methods, we only list models which were evaluated by means of the LOSO cross validation. Furthermore, in our experiments we used the standard segmentation scheme of 5.5 s per sample from BioVid [3]. It is important to note that methods marked with (*) cannot be directly compared with the other performance results because of a diverging sample segment size of 4.5 s (example shown in [8]). However, we included these methods for the sake of completeness. The best LOSO cross validation performance is individually emphasized in bold for unimodal and multimodal models. In addition, the best results per section for the model group with diverging segmentation scheme (*) are underlined. In the method descriptions, HCFs stands for HandCrafted Features and CA means Cross-Attention.

Table 4. LOSO cross validation performance comparison of pain recognition models on the BioVid heat pain database using classification accuracy and

F_{1}

-Score (in %). Values are stated in the format {Accuracy/

F_{1}

-Score}. The best values are individually emphasized in bold for unimodal and multimodal models per classification task. Additionally, the best values per section for the model group with diverging segmentation scheme (*) are underlined. The (✓) symbol signifies the use of a certain modality for the task of pain recognition.

In accordance with our former experimental results about modality significance (see Table 2), we observe that the biopotential EDA mainly contributes to the high-quality predictions in previous pain recognition models. Therefore, both Transformer Encoder models which are highly specialized on the EDA modality greatly outperform the other unimodal approaches. Regarding the multimodal methods with standard segmentation, our designed MM-SCAAE model constitutes the new state-of-the-art performance for all binary classification tasks. Moreover, MM-SCAAE’s accuracy is also competitive with the models with diverging segmentation scheme (*), and MM-SCAAE’s

F_{1}

-score outperforms the formerly reported best value by a significant margin of over two percentage points. Note that the

F_{1}

-score is typically more expressive and relevant than accuracy in the medical context.

At first glance, our model appears inferior to the unimodal transformer encoders considering both accuracy and

F_{1}

-scores for all four pain recognition tasks. However, we implemented our MM-SCAAE framework in the experiments with identical autoencoder architectures for arbitrary sensor data in order to satisfy the trade-off between performance and computational effort. In principle, MM-SCAAE offers the capacity to utilize highly specialized feature extractors, such as the EDA-transformer encoder, for certain modalities. Thus, we proposed a universal information fusion architecture for the multi-sensor classification domain of time-series data. In particular, incorporating information from multiple and heterogenous sources is essential for trustworthy decision-making in health care.

Since most works on pain recognition using the BioVid dataset report their classifier quality solely based on classification accuracy, we primarily had to rely on the accuracy metric to ensure valid comparability to existing models in the literature. Although accuracy appears to be a sufficient indicator to compare the performance of different classification models, its expressiveness to evaluate a classifier’s quality in a sensitive domain such as medicine is strongly limited. To account for this circumstance, we additionally state in Table 5 the sensitivity and specificity classification metrics of our model in relation to the state-of-the-art unimodal pain recognition approaches. For the most relevant task of no-pain (

T_{0}

) versus highest pain (

T_{4}

), our model achieves a sensitivity on par with the highest-reported value. Directly compared with the Transformer Encoder model, our classifier produces a sound balance between sensitivity and specificity, again reflecting the potency of our MM-SCAAE pretraining framework.

Table 5. LOSO cross validation performance comparison of pain recognition models on the BioVid heat pain database using the sensitivity and specificity classification metrics (in %). Values are stated in the format {Sensitivity/Specificity}. The best metric value per classification task is stated in bold.

4.6.5. Grouped-Prediction Estimate

Despite the above performance benchmarks in terms of single-sample accuracy, at this point, it is not clear if current pain recognition models are adequately mature for practical application. In more detail, there exist crucial limitations on the data collection process for pain classification databases, driven by cost and ethical considerations, as well as hard assumptions for bodily reactions on artificially induced pain intensities. A major shortcoming arises from the punctual evaluation of recorded biophysiological signals for small time frames of a few seconds, since reasonable medical diagnosis is usually substantiated through longer observation of permanent pain or recurrent pain episodes in order to select an appropriate treatment. This certainly does not mean that in practical use cases pain events cannot emerge in short-time periods, e.g., of a few seconds, but their reliable characterization must depend on recurrent sensory measurements (cf. [1]). Moreover, we expect natural variation in both pain intensities and biophysiological reactions of different pain episodes—even if originating from a common cause. Such variation could produce inconsistencies in pain categorization for discrete moments. Hence, we examine in the subsequent post-processing analysis how the LOSO classification accuracy of our model from the first experiment is affected by the natural variation in individual pain perception per patient. For this purpose, we estimate the occurrence probability

P (c | p)

of the pain class c for a regarded patient p as the expectation over grouped predictions obtained via Monte Carlo (MC) sampling:

P (c | p) \approx E_{G \sim p_{data}} E_{G} [z_{θ} (c | X)] = \frac{1}{M K} \sum_{m = 1}^{M} \sum_{k = 1}^{K} z_{θ} (c | X_{k}^{(m)}) .

(22)

Here, a group is defined as

G = {X_{k} \in R^{D \times C}}_{k = 1}^{K}

, which comprises K distinct data points

X_{k}

from patient p with D feature dimensions and C channels, drawn from the underlying data distribution

p_{data}

. The composite prediction

z_{θ} (c)

, emitted from our model with learned parameters

θ

, is determined by marginalization over the random group

G

to provide a more stable probability estimate that reduces variational effects. Intuitively, this formulation of grouped predictions can also be interpreted as a Bayesian model averaging [69,70] strategy based on an equal-weighted mixture of model posteriors. For further explanation, refer to Appendix F.

Figure 4 displays the impact of the group size on the resulting two-class (

T_{0}

vs.

T_{4}

) LOSO accuracy distribution, received via MC sampling of random groups over 100 iterations. With a growth in group size, the mean LOSO accuracy at first strongly improves and, finally, starts to saturate around a group size of 10 samples. The gray-shaded region spans the gain potential for LOSO accuracy from the deterministic evaluation over all samples in the test set (dotted red line) to the grouped prediction over all test samples (dotted blue line). This leads to a maximal accuracy value of ≈95% in total, which is an improvement of over ten percentage points compared to the deterministic sample-wise evaluation. In particular, the grouped-prediction estimate maximally extends the observation period of 5.5 s per sample to 110 s for the full group size of twenty samples per pain class and patient. Based on these considerations, we argue that current pain recognition models are indeed capable of providing additional guidance for medical pain assessment. Such continual assistance systems may help health personnel to make decisions about appropriate medication more quickly and help to prevent possible drug abuse.

Figure 4. Growth in two-class (

T_{0}

vs.

T_{4}

) LOSO accuracy over grouped predictions per patient, with an increasing number of samples per group. Each box plot visualizes the LOSO accuracy distribution for the respective group size, which is approximated through MC sampling of random groups over 100 iterations. The accuracy gain potential (gray-shaded area) signifies the performance discrepancy between deterministic single-sample classification (red-dotted line) and pain class prediction grouped over all test instances (blue-dotted line).

5. Discussion

5.1. Perspectives on Pain Assessment

Our proposed MM-SCAAE framework opens new perspectives on the reliability of automated pain recognition systems through the integration of multiple biopotentials into the decision-making process, and by proving state-of-the-art performance on multimodal pain recognition using the BioVid dataset. Nevertheless, the current superiority of unimodal models [13,14] with highly specialized network architectures based on the EDA signal points to the major limitation of multimodal approaches, which lies in their proportional scaling to the number of involved sensors. This means that despite the architectural flexibility of MM-SCAAE to subsume unimodal feature extractors, there remains a strict trade-off between model capability and computational requirements.

Regarding the practical realization of pain assessment systems, we believe that multimodal approaches are crucial for trustworthiness and acceptance in health care. Individual sensor contribution may vary, particularly in clinical environments in the real world, as a result of patients’ conditions, medical treatments or external factors. Hence, we propose our MM-SCAAE model for future research attempts to broaden the inclusion of other biopotentials into automated pain assessment. The demonstrated gain in pain recognition accuracy by means of grouped-prediction estimates stemming from multiple samples of length 5.5 s also reveals a potential mismatch between dataset benchmarking based on short-term observations and evaluation for practical applicability. More precisely, we argue that an approximate pain recognition performance of 95% for an extended observation window of around 110 s is still acceptable for the vast majority of clinical applications, since patients would instead need to articulate their subjective pain perception for proper diagnosis. In cases of non-communicative patients (e.g., infants, mentally impaired or unconscious persons), the performance gain established through the grouped-prediction estimate may significantly improve the quality of pain diagnosis. Thus, we are convinced that the collection of real-word datasets with enlarged observation periods is the key to closing the gap between pain research and its practical application.

5.2. Generic Multi-Sensor Pretraining for Time-Series Classification

The pain recognition experiments demonstrated that a key benefit of our pretraining model is that it refrains from comprehensive signal preprocessing in terms of feature engineering (cf. [3,5,9,10,11]) and dimensionality reduction (cf. [6,7,8,12]). In particular, MM-SCAAE directly applies a kind of dimensionality reduction on the raw input signals by means of the introductory MSCNN, which was inspired by the state-of-the-art works [13,14] on unimodal pain recognition. Moreover, our model merely needs half of the pretraining epochs of former methods [6,12,13,14]. These circumstances may serve as indicators for the general capacity of MM-SCAAE.

Apart from the concrete use case of pain recognition, we declare MM-SCAAE as generic model for multi-sensor pretraining which can be tailored by implementing application-specific autoencoder components. Ideally, we would plug in as many sensors as available to maximize the confidence in model predictions, and the model should apply a relevance-weighted information fusion. To make this scenario feasible, MM-SCAAE delivers a rather lightweight architecture which linearly scales with the sensor count. Specifically, the enrichment of AAEs with a SupCon objective also constitutes a generic model type, forming an interesting future research direction.

5.3. Supervised Contrastive Adversarial AutoEncoder (SCAAE)

Adversarial regularization is superior for distribution discrimination compared to the trivial KL-divergence regularizer of VAEs, since adversarial regularization cannot be easily tricked (see Section 2.2.2), and its computational effort is negligible for small two-layer MLPs like we used in the MM-SCAAE framework. During our experiments, we especially did not encounter training instabilities, as is known from the generally error-prone GAN [35] training procedure. As a unique characteristic, the integration of the SupCon objective enhances the primary SAAE [34] framework through class partitioning of latent spaces. For further information about the performance of SCAAE, we refer to Appendix D, where we provide an empirical verification for its effectiveness and generative capabilities. Due to the restricted scope of this paper, we did not explore unsupervised or semi-supervised contrastive AAEs. However, we hypothesize that—even without label information—diverse representation learning algorithms may benefit from auxiliary contrastive regularizations. We hope that future research efforts will evolve the theoretical framework of contrastive AAEs.

5.4. Limitations of This Study and Future Work

The BioVid dataset focuses on short-term measurements of biophysiological pain reactions with a length of 5.5 s. We hypothesize that longer observations of certain modalities could change their individual contribution. In particular, we are convinced that a longer observation of the heart rate for better capturing its variability should facilitate pain classification. Hence, we identify the need for collecting new datasets which account for the latency in the pain reaction of certain modalities. Moreover, we empirically verified the performance of our proposed approach with experiments on the heat pain dataset BioVid, because it is publicly available and its usage is widely spread. However, it could be an interesting future research perspective to investigate the performance of our MM-SCAAE pretraining framework on pain recognition datasets with diverse pain reasons (e.g., electric shock or tactile pressure). Nevertheless, only the gathering of real-world datasets with distinctive pain reasons will eventually allow for the reliable evaluation of state-of-the-art pain recognition models for practical applicability.

Although our model empirically proved resilience against partial sensor failure, our experiments on modality significance emphasized that performance can still dramatically degenerate if highly contributing sensors are unavailable. However, such degeneration effects should constantly reduce with the inclusion of further sensors. Thus, a promising direction for future work is the exploration of other sensors in the clinical setting (e.g., for respiration or blood pressure) to assure trustworthy pain recognition. Our experimental results demonstrated the generalizability of the proposed MM-SCAAE framework in the challenging domain of pain recognition, where datasets are typically scarce and highly biased through the use of small subject groups. In the case of BioVid, for instance, the whole dataset stems from measurements of merely 87 patients. Moreover, we restrict the application scope of the designed MM-SCAAE pretraining model to multimodal time-series classification. Since we strictly focus in our study on the task of pain recognition, we encourage future research efforts that investigate the general capacity of the MM-SCAAE framework and its sensitivity to parameter settings for a wide spectrum of time-series classification tasks. Due to the introduction of an ordinal SupCon loss, it would be especially interesting to exploit inter-class relationships for improving performance in ordinal classification tasks.

The core contribution of our study concerns the technical realization of automated pain recognition systems based on the BioVid heat pain database, which originates from a controlled experiment with healthy probands and standardized environmental factors. We briefly mentioned in our work that in realistic clinical settings patients suffer from diseases, receive medication and may be exposed to external factors, which may drastically change pain-related body reactions. Nevertheless, we did not provide a structural guidance for the hard requirements on pain recognition systems that may arise in a real-world setting. More precisely, our technically oriented study did neither examine the influence of interference factors (e.g., mental conditions, diseases or medication) nor cover the aspect of patient diversity (e.g., ethnic or cultural background) on the resulting performance of pain recognition models. It is necessary to address all these aspects in future research attempts to promote trustworthiness in automated pain recognition. Due to the intrusion of such systems into the subjective perception of individuals, practical applications must ultimately be accompanied with medical ethics and regulatory compliance to prevent the potential misuse of this sensitive information.

6. Conclusions

In this paper, we proposed the multi-sensor pretraining framework MM-SCAAE for addressing the task of multimodal pain recognition based on three selected biophysiological signals. MM-SCAAE combines simultaneous autoencoder training with adversarial and supervised contrastive regularization to ensure robust representation learning. We designed task-specific implementations for all MM-SCAAE components and conducted comprehensive experiments on the BioVid dataset to prove the model’s capability for automated pain recognition. In particular, we achieved a new state-of-the-art in terms of accuracy and

F_{1}

-score for the multimodal case. Our convincing results should steer researchers’ attention away from unimodal methods and back to multimodal approaches for improving the trustworthiness of pain recognition systems. Finally, we showed that grouped-prediction approaches have the potential to significantly boost the quality of pain assessment in practice.

Author Contributions

Conceptualization, N.A.K.S. and F.S.; methodology, N.A.K.S.; software, N.A.K.S.; validation, N.A.K.S. and F.S.; formal analysis, N.A.K.S.; investigation, N.A.K.S.; resources, F.S.; data curation, N.A.K.S. and F.S.; writing—original draft, N.A.K.S.; writing—review and editing, N.A.K.S. and F.S.; visualization, N.A.K.S.; supervision, F.S.; project administration, N.A.K.S. and F.S.; funding acquisition, F.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work of Friedhelm Schwenker was supported by the German Research Foundation (DFG) under Grant SCHW 623/7-1.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. The heat pain database BioVid is accessible at https://www.nit.ovgu.de/BioVid.html (accessed on 14 October 2025). The handwritten digit dataset MNIST is accessible at [71]. Source code is available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Variational Lower Bound and Denoising Criterion

Let

p_{θ} (X, Z)

be the joint distribution between the observations X and their latent variables Z, and

p_{θ}

the autoencoder model distribution with parameters

θ = {ϕ, ψ}

, where

ϕ

and

ψ

refer to the encoder

q_{ϕ} (Z | X)

and the decoder

p_{ψ} (X | Z)

components, respectively. Inequality relationships in the subsequent proofs are derived from the Jensen’s inequality. At first, we provide a simple proof to the Variational Lower Bound (VLB) in Equation (2).

Proof.

\begin{matrix} VLB & = E_{X} E_{q_{ϕ} (Z | X)} [log \frac{p_{θ} (X, Z)}{q_{ϕ} (Z | X)}] \\ \leq E_{X} [log (E_{q_{ϕ} (Z | X)} [\frac{p_{θ} (X, Z)}{q_{ϕ} (Z | X)}])] \\ = E_{X} [log (\int p_{θ} (X, Z) \frac{q_{ϕ} (Z | X)}{q_{ϕ} (Z | X)} d Z)] \\ = E_{X} [log p_{θ} (X)] \end{matrix}

□

Next, we show that the VLB can be directly translated into a reconstruction objective

(- L_{Recon .})

with a regularization of prior distribution alignment

(- R_{Prior})

, as stated in Equation (3). For this purpose, we denote the Shannon entropy as

H (p (X)) = E_{p (X)} [- log p (X)]

and the cross entropy as

H (p (X), q (X)) = E_{p (X)} [- log q (X)]

. Further note that the Kullback–Leibler divergence is determined via

D_{KL} (p (X) | | q (X)) = H (p (X), q (X)) - H (p (X))

.

Proof.

\begin{matrix} VLB & = E_{X} E_{q_{ϕ} (Z | X)} [log \frac{p_{θ} (X, Z)}{q_{ϕ} (Z | X)}] \\ = E_{X} E_{q_{ϕ} (Z | X)} [log (p_{ψ} (X | Z) q (Z)) - log q_{ϕ} (Z | X)] \\ = E_{X} [- L_{Recon .} + E_{q_{ϕ} (Z | X)} [log q (Z) - log q_{ϕ} (Z | X)]] \\ = E_{X} [- L_{Recon .} - H (q_{ϕ} (Z | X), q (Z)) + H (q_{ϕ} (Z | X)] \\ = E_{X} [- L_{Recon .} - D_{KL} (q_{ϕ} (Z | X) | | q (Z))] \\ = E_{X} [- L_{Recon .} - R_{Prior}] \end{matrix}

□

Evidently, the VLB enforces the construction of latent representation manifolds aligned with the prior distribution

q (Z)

, and maximizes space exploration through the demanded gain in the approximated posterior entropy. Involving a denoising criterion in the variational autoencoder framework, where stochastically corrupted input is consumed by the encoder and the decoder has to recover the original input leads to an analogous Denoising Variational Lower Bound (DVLB). Following [32], the encoder

{\tilde{q}}_{ϕ} (Z | X) = \int q_{ϕ} (Z | \tilde{X}) p_{π} (\tilde{X} | X) d \tilde{X}

needs to integrate over the noisy inputs using a predefined corruption distribution

p_{π} (\tilde{X} | X)

.

Proof.

\begin{matrix} DVLB & = E_{X} E_{{\tilde{q}}_{ϕ} (Z | X)} [log \frac{p_{θ} (X, Z)}{q_{ϕ} (Z | \tilde{X})}] \\ \leq E_{X} [log (E_{{\tilde{q}}_{ϕ} (Z | X)} [\frac{p_{θ} (X, Z)}{q_{ϕ} (Z | \tilde{X})}])] \\ = E_{X} [log (\int \int p_{θ} (X, Z) \frac{{\tilde{q}}_{ϕ} (Z | X)}{{\tilde{q}}_{ϕ} (Z | X)} d \tilde{X} d Z)] \\ = E_{X} [log p_{θ} (X)] \end{matrix}

□

We can identify the denoising reconstruction ability (

- L_{Recon .}^{'}

) and the encoding space regularization (

- R_{Prior}^{'}

) with corrupted inputs as the updated autoencoder objectives.

Proof.

\begin{matrix} DVLB & = E_{X} E_{{\tilde{q}}_{ϕ} (Z | X)} [log \frac{p_{θ} (X, Z)}{q_{ϕ} (Z | \tilde{X})}] \\ = E_{X} E_{{\tilde{q}}_{ϕ} (Z | X)} [log (p_{ψ} (X | Z) q (Z)) - log q_{ϕ} (Z | \tilde{X})] \\ = E_{X} [- L_{Recon .}^{'} + E_{{\tilde{q}}_{ϕ} (Z | X)} [log \frac{q (Z)}{q_{ϕ} (Z | \tilde{X})}]] \\ = E_{X} [- L_{Recon .}^{'} - D_{KL} (q_{ϕ} (Z | \tilde{X}) | | q (Z))] \\ = E_{X} [- L_{Recon .}^{'} - R_{Prior}^{'}] \end{matrix}

□

The DVLB delivers more robust loss functions to small input perturbations compared to a plain variational autoencoder, with the precondition that an appropriate corruption distribution is chosen [32]. As mentioned in [31], it can easily be shown that the DVLB is equivalent to maximizing the mutual information

I (X, Z) = H (Z) - H (Z | X)

between original data instances X and the latent variables Z. Evidently, this analogously holds for the VLB.

Proof.

\begin{matrix} DVLB & = E_{X} E_{{\tilde{q}}_{ϕ} (Z | X)} [log \frac{p_{θ} (X, Z)}{q_{ϕ} (Z | \tilde{X})}] \\ = E_{X} E_{{\tilde{q}}_{ϕ} (Z | X)} [- log q_{ϕ} (Z | \tilde{X}) + log p_{θ} (X, Z)] \\ = E_{X} [H (\tilde{q} (Z | X))] + E_{X} E_{{\tilde{q}}_{ϕ} (Z | X)} [log p_{θ} (X, Z) \frac{p (X)}{p (X)}] \\ = H (Z) + E_{p_{θ} (X, Z)} [log \frac{p_{θ} (X, Z)}{p (X)}] + E_{X} [log p (X)] \\ = H (Z) - H (Z | X) - H (X) \\ = I (X, Z) + const . \end{matrix}

□

Appendix B. Confidence-Aware Similarity Measure

The inner product between two vectors

\hat{x}

and

\hat{y}

can be decomposed into

⟨\hat{x}, \hat{y}⟩ = ⟨u (\hat{x}), u (\hat{y})⟩ | | \hat{x} | | \cdot | | \hat{y} | |,

(A1)

where

u (\cdot)

means the respective unit vector. The scalar product based on both unit vectors corresponds to the cosine-similarity, while the two final coefficients refer to the prediction confidence in the respective latent representations. Table A1 displays the influence of model prediction confidence on the proposed similarity measure for contrastive learning in Equation (10).

Table A1. Impact of model confidence in predictions on the inner product of distributed representations.

	$⟨u (\hat{x}), u (\hat{y})⟩$	Cosine Similarity
$\| \| \hat{x} \| \| \cdot \| \| \hat{y} \| \|$		Low	High
Prediction Confidence	low	0	$\| \| \hat{x} \| \| \cdot \| \| \hat{y} \| \|$
Prediction Confidence	high	$⟨u (\hat{x}), u (\hat{y})⟩$	1

If the vector orientations strongly differ and the model is very unsure about the consistency of one or both encoded representations, the resulting similarity value approaches zero. If the model declares a high confidence in both latent representations, then we receive a normalized value (i.e.,

\in [0, 1]

) of the cosine-similarity. The most interesting case occurs when a single low-confidence prediction significantly decreases the similarity value between two considered instances. This implies a reduction in the impact of false negatives and false positives on the contrastive loss caused by samples which are recognized as outliers. In consequence, the model profits from a more dedicated optimization and an improved generalizability (depending on the amount of outliers in the dataset).

Appendix C. Ordinality-Aligned ϵ–SupInfoNCE Derivation

Following [43], we can derive the loss

R_{Con .}^{Ord .}

in Equation (15) based on our defined optimization constraint in Equation (13) (i.e., with sensitivity to class ordinality) by approximating the error function using the LogSumExp expression:

Proof.

\begin{matrix} L & = \frac{1}{| P |} \sum_{i} max (0, {s_{j}^{(n)} - s_{i}^{(p)} + f_{ϵ} (r_{n p})}_{j, n \neq p}) \\ \approx \frac{1}{| P |} \sum_{i} log (1 + \sum_{j} exp (s_{j}^{(n)} - s_{i}^{(p)} + f_{ϵ} (r_{n p}))) \\ = - \frac{1}{| P |} \sum_{i} log (\frac{exp (s_{i}^{(p)})}{exp (s_{i}^{(p)}) + \sum_{j} exp (s_{j}^{(n)} + f_{ϵ} (r_{n p}))}) \\ = R_{Con .}^{Ord .} \end{matrix}

□

Appendix D. Robust Supervised Adversarial AutoEncoder (SAAE)

Our generalized Supervised Contrastive Adversarial AutoEncoder (SCAAE) framework extends traditional SAAE [34] training with class label injection, via reserved one-hot vector to the decoder network, by means of the following two modifications:

(1)

Label information is only implicitly provided to the generative model through backpropagation of the

ϵ

-SupInfoNCE losses in Equations (11) and (15), and

(2)

Discriminator capacity is reduced to the task complexity of distribution determination. Modification

(1)

uses a fusion network, which can also be interpreted as a projection network, which allows for the proper realization of the SupCon objective while preventing destructive distortions in the generative encoding space. Modification

(2)

stabilizes the error-prone GAN [35] training based on discriminator simplification, which satisfies distribution determination and balances the relation to the multi-objective generator.

Appendix D.1. Experimental Setup

To empirically verify the above statements, we conduct experiments on the handwritten digit dataset MNIST [37,38] (version used from TensorFlow Datasets [71]) for a qualitative comparison using an analog model architecture to [34]. Therefore, the encoder and decoder networks consist of two 1000-unit hidden layers with Rectified Linear Unit (ReLU) activation. The encoder ends with a dense layer of two units and linear activation, while the decoder comprises a final 784-unit layer with sigmoid activation. In contrast to [34], our deterministic discriminator network is simplified to a 256-unit hidden layer followed by a 128-unit hidden layer with ReLU activation. The discriminator’s output layer contains a single neuron with sigmoid activation. The fusion network utilizes the same structure as the discriminator but owns 32 output neurons with linear activation. The fusion network normalizes its final outcome using the squash function in Equation (9) (with

α = 0.1

) or via unit-vector conversion, depending on the performed experiment. A Batch Normalization [64] layer takes place in each fully connected layer before applying the activation function, except for output layers. The training configuration comprises a 2D Gaussian prior distribution with constant zero mean and standard deviation of 5, the addition of noise

\sim N (0, 0.05)

to the logits of the discriminator labels, a batch size of 128, a SupCon loss temperature of

τ = 0.1

, the application of an exclusive Adam [66] optimizer with AMSGrad [67] for the generative model and a second one for the discriminator, 100 training epochs, and a constant learning rate of

10^{- 4}

.

Figure A1. Generalized framework of Supervised Contrastive Adversarial AutoEncoder (SCAAE) trained on MNIST using a 2D Gaussian prior distribution with zero mean and a standard deviation of 5. Visualization of the learned 2D latent encoder space based on MNIST’s hold-out test set for (a,e) nominal and (b,f) ordinal

ϵ

–SupInfoNCE loss. (a,b) conduct squash normalization while (e,f) utilize unit-vector normalization in the last layer of the fusion network. (c,d) display the reconstructions within the feature range of

[- 10; 10]

(step size

= 2

) for the generative encoder spaces from (a,b), respectively.

Appendix D.2. Nominal ϵ–SupInfoNCE Loss

In the first experiment, the nominal

ϵ

–SupInfoNCE loss in Equation (11) is used where samples from all differing class-to-class pairs are treated equally. Since the fusion network restricts output vectors to magnitudes in the range of

[α, 1]

, we choose the margin parameter

ϵ = \frac{1}{n - 1}

(A2)

for n classes to obtain a proportional space exploitation. For MNIST with its ten digit classes, this means a desired minimal margin

ϵ \approx 0.1

between samples from two classes. Figure A1 illustrates the experimental results based on the learned latent encoder space for the squash function in (a) and for the unit-vector normalization in (e). In both cases, SCAAE ensures that the encoder space approaches the 2D Gaussian prior distribution. Specifically, the encoder spaces shape sections within the prior distribution that are clearly separated and exclusively associated with a single digit class. During optimization, the encoder has to make a trade-off between the competitive goals of generative ability and class encapsulation. The decoded latent representations in (c) for the encoding space (a) visualize the learned generative capability with smooth inter-class transitions. In practice, we found that during multiple runs, the squash variant tends to converge faster and to more convincing results, supporting our reasoning in Appendix B. Compared to traditional SAAE [34], our approach needs neither an a priori assignment of dedicated prior distributions to classes nor the direct bypassing of label information to the decoder network. Moreover, the success of bypassing label information has the central disadvantage of being dependent on the number of latent dimensions, i.e., with growing latent dimensionality, class information tends to be stored redundantly, which can harm class separation. Our SCAAE approach eliminates these shortcomings, leading to a more robust SAAE variant.

Appendix D.3. Ordinal ϵ–SupInfoNCE Loss

In the next experiment, the impact of the ordinal

ϵ

–SupInfoNCE loss in Equation (15) on the latent encoder space is investigated by enforcing an artificial class-ordinality constraint on MNIST. In general, we define a circulant graph consisting of the nominal class labels as nodes (i.e., class indices from 0 to

n - 1

). If we assume more than six classes, then the circulant class-to-class distance matrix can be described as

D_{n > 6} = [\begin{matrix} 0 & 1 & 2 & \dots & 2 & 1 \\ 1 & 0 & 1 & 2 & \dots & 2 \\ 2 & 1 & 0 & 1 & 3 \\ ⋮ & ⋮ & 1 & 0 & ⋱ & ⋮ \\ 2 & 3 & ⋱ & ⋱ & 1 \\ 1 & 2 & \dots & 2 & 1 & 0 \end{matrix}] .

(A3)

As unique characteristic, the circulant matrix repeats per row and column circular versions of the same vector

d_{n > 6} = {[\begin{matrix} 0 & 1 & 2 & \dots & m & \dots & 3 & 2 & 1 \end{matrix}]}^{T},

(A4)

respectively shifted by row or column index. The value (and index) of the mean element m with the longest distance can be determined with

m = \frac{n}{2}

, if the condition

n mod 2 = 0

is true. If n is an odd number then

m = \frac{n - 1}{2}

, and the m value has to be repeated one time before distances start to lower. The ten digit classes from MNIST constitute for the first class (with zero-index) the distance vector

d_{n = 10} = {[\begin{matrix} 0 & 1 & 2 & 3 & 4 & 5 & 4 & 3 & 2 & 1 \end{matrix}]}^{T} .

(A5)

According to Equation (14), we formulate the rank-based

ϵ

-margin function between the classes i and j as

f_{ϵ} (r_{i j}) = 0.1 \cdot d_{n = 10} (r_{i j}), r_{i j} = | i - j |,

(A6)

where the coefficient

0.1

represents the minimal margin between samples from different classes, and

d_{n = 10} (k)

resolves to the k-th vector element of

d_{n = 10}

. Note that the relation

d_{n = 10} (r_{i j}) = D_{n = 10} (i, j)

holds due to the access of vector elements by relative class distance as offset. Figure A1 displays the latent encoder spaces for the squash function in (b) and for the unit-vector normalization in (f). Again, both ordinal loss variants result in a successful approximation of the 2D Gaussian prior distribution, with clearly separated sections assigned to specific classes. In particular, the configuration with the squash function as the normalization strategy perfectly matches the artificially defined circular class-ordinality constraint. This circumstance can be further verified with the respective digit generations in (d), where digits from zero to nine are arranged counter-clockwise with smooth inter-class transformations. The latent encoding space of the unit-vector normalization in (f) reveals a deficit in conserving the predefined class ordinality. We hypothesize that this situation originates from a reduced encoding space exploration during training caused by the unit-vector conversion. Note that we applied both configurations a few runs and selected the best reached results, since the ordinality objective means a soft constraint as part of a multimodal optimization that can not ensure converging to the optimal solution each time. Nevertheless, we observed in total a significantly higher result quality and consistency utilizing the squash function. Thus, SCAAE provides the built-in capability to support ordinality-aligned classification through the definition of task-oriented

ϵ

-margin functions.

Appendix E. Summary of the Global Setup

In Table A2, we structure the hyperparameters from the global experimental setup in Section 4.5 by categorizing after hyperparameter types and providing for each parameter a short description. Hence, the interested reader obtains a quick reference on general implementation details. For the detailed description of the designed network components, we refer to Figure 2 in Section 4.4.

Table A2. Structured summary of the global hyperparameter settings for all experiments, fully stated in Section 4.5.

Type	Hyperparameter	Description
Architecture	Batch Normalization	Placed after each layer (except for output layers) befor applying the activation function.
Architecture	Dropout rate of $0.2$	Applied on the input signals of each layer.
Pre-/training	AdamW	Gradient descent optimizer with activated AMSGrad option.
	Learning rate of $0.001$	Passed as argument to AdamW.
	Batch size of 128	Number of random samples per optimizer iteration.
	Pretraining epochs of 50	For training the MM-SCAAE framework.
	Training epochs of 10	For training the classifier on top of the frozen autoencoders and the fusion network.
	Zero-mean Gaussian (initialized with $σ^{2} = 1.0$ )	Used prior distribution type for each adversarial regularization with adaptive parameter estimation.
	Momentum factor of $\frac{1}{τ} = 0.01$	Used as sensitivity parameter for adaptive variance updates, as stated in Equation (4).
Losses	Noise addition $\sim N (0, 0.05)$	To the target logits of each adversarial regularizer.
	SupCon loss temperature with value $τ = 0.1$	Value chosen as suggested in [42,43].
	MSE reconstruction loss	For optimizing modality-specific autoencoders.
	Cross entropy with label smoothing (factor of $0.3$ )	During classifier training to account for potential uncertainty in pain classes.

Appendix F. Bayesian Model Averaging via Prediction Grouping

According to the Law of Total Probability, we can calculate the marginal probability

P_{T} (c | p)

that patient p experiences pain of class c within the time span

t \in {1, . ., T}

of recorded biophysiological measurements as

P_{T} (c | p) = \sum_{t} z_{θ} (c | X_{t}) p_{data} (X_{t}) .

(A7)

Due to the continuous monitoring of possible pain reaction in the patient’s body, we have to properly approach this quantity using grouped predictions, leading to

P_{T} (c | p) \approx E_{G} [z_{θ} (c | X)] = \sum_{k = 1}^{K} z_{θ} (c | X_{k}) p_{data} (X_{k}),

(A8)

where a group is composed as set

G = {X_{k} \in R^{D \times C}}_{k = 1}^{K}

and

G \subseteq {X_{t}}_{t = 1}^{T}

. Since we have no direct access to the data-generating distribution

p_{data}

, we simply substitute the above expression with the proportional relationship

P_{G} (c | p) \propto \sum_{k = 1}^{K} z_{θ} (c | X_{k}) .

(A9)

To regain a valid probability value as composite prediction, we calculate the arithmetic mean over all group members by

P_{G} (c | p) \approx \frac{1}{K} \sum_{k = 1}^{K} z_{θ} (c | X_{k})

(A10)

which corresponds to the core component of our grouped prediction estimate in Equation (22). In a broader sense, Equation (A10) describes a Bayesian model averaging [69,70] strategy, where our predictor model

z_{θ} (c | \cdot)

, parameterized with

θ

, accesses a mixture of internal prediction distributions to generate a prediction value. If we assume that each internal prediction distribution captures distinct characteristics for a certain pain class and the magnitude of class-specific characteristics is fluctuating per considered measurement, then averaging over grouped measurements eliminates deceptive variations to shape a more robust predictor estimate. For the reason that we have neither knowledge about which internal prediction distribution is selected per sample nor information about the relevance per partial predictor, we must remain with the arithmetic group mean. However, Figure 4 reveals that for the baseline measurements

T_{0}

and the highest-intensity pain class

T_{4}

, natural variation can be effectively canceled out with a group size of above 10 samples.

References

Werner, P.; Al-Hamadi, A.; Niese, R.; Walter, S.; Gruss, S.; Traue, H.C. Towards Pain Monitoring: Facial Expression, Head Pose, a new Database, an Automatic System and Remaining Challenges. In Proceedings of the British Machine Vision Conference (BMVC), Bristol, UK, 9–13 September 2013; pp. 1–13. [Google Scholar]
Walter, S.; Gruss, S.; Ehleiter, H.; Tan, J.; Traue, H.C.; Werner, P.; Al-Hamadi, A.; Crawcour, S.; Andrade, A.O.; Moreira da Silva, G. The BioVid Heat Pain Database: Data for the Advancement and Systematic Validation of an Automated Pain Recognition System. In Proceedings of the 2013 IEEE International Conference on Cybernetics (CYBCO), Lausanne, Switzerland, 13–15 June 2013; pp. 128–131. [Google Scholar] [CrossRef]
Werner, P.; Al-Hamadi, A.; Niese, R.; Walter, S.; Gruss, S.; Traue, H.C. Automatic Pain Recognition from Video and Biomedical Signals. In Proceedings of the 22nd International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 24–28 August 2014; pp. 4582–4587. [Google Scholar] [CrossRef]
Thiam, P.; Kestler, H.A.; Schwenker, F. Two-Stream Attention Network for Pain Recognition from Video Sequences. Sensors 2020, 20, 839. [Google Scholar] [CrossRef] [PubMed]
Lopez-Martinez, D.; Picard, R. Continuous Pain Intensity Estimation from Autonomic Signals with Recurrent Neural Networks. In Proceedings of the 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, USA, 18–21 July 2018; pp. 5624–5627. [Google Scholar] [CrossRef]
Thiam, P.; Bellmann, P.; Kestler, H.A.; Schwenker, F. Exploring Deep Physiological Models for Nociceptive Pain Recognition. Sensors 2019, 19, 4503. [Google Scholar] [CrossRef]
Thiam, P.; Kestler, H.A.; Schwenker, F. Multimodal Deep Denoising Convolutional Autoencoders for Pain Intensity Classification based on Physiological Signals. In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM), Valletta, Malta, 22–24 February 2020; pp. 289–296. [Google Scholar] [CrossRef]
Gouverneur, P.; Li, F.; Adamczyk, W.M.; Szikszay, T.M.; Luedtke, K.; Grzegorzek, M. Comparison of Feature Extraction Methods for Physiological Signals for Heat-Based Pain Recognition. Sensors 2021, 21, 4838. [Google Scholar] [CrossRef]
Kächele, M.; Thiam, P.; Amirian, M.; Werner, P.; Walter, S.; Schwenker, F.; Palm, G. Multimodal Data Fusion for Person-Independent, Continuous Estimation of Pain Intensity. In Proceedings of the Engineering Applications of Neural Networks (EANN), Rhodes, Greece, 25–28 September 2015; pp. 275–285. [Google Scholar] [CrossRef]
Kächele, M.; Amirian, M.; Thiam, P.; Werner, P.; Walter, S.; Palm, G.; Schwenker, F. Adaptive confidence learning for the personalization of pain intensity estimation systems. Evol. Syst. 2017, 8, 71–83. [Google Scholar] [CrossRef]
Kächele, M.; Thiam, P.; Amirian, M.; Schwenker, F.; Palm, G. Methods for Person-Centered Continuous Pain Intensity Assessment From Bio-Physiological Channels. IEEE J. Sel. Top. Signal Process. 2016, 10, 854–864. [Google Scholar] [CrossRef]
Thiam, P.; Hihn, H.; Braun, D.A.; Kestler, H.A.; Schwenker, F. Multi-Modal Pain Intensity Assessment Based on Physiological Signals: A Deep Learning Perspective. Front. Physiol. 2021, 12, 720464. [Google Scholar] [CrossRef]
Lu, Z.; Ozek, B.; Kamarthi, S. Transformer encoder with multiscale deep learning for pain classification using physiological signals. Front. Physiol. 2023, 14, 1294577. [Google Scholar] [CrossRef] [PubMed]
Li, J.H.; Luo, J.C.; Wang, Y.S.; Jiang, Y.X.; Chen, X.; Quan, Y.J. Automatic Pain Assessment Based on Physiological Signals: Application of Multi-Scale Networks and Cross-Attention. In Proceedings of the 13th International Conference on Bioinformatics and Biomedical Science (ICBBS), Hong Kong, 18–20 October 2024; pp. 113–122. [Google Scholar] [CrossRef]
Li, G.; Yu, Y. Visual Saliency Detection Based on Multiscale Deep CNN Features. IEEE Trans. Image Process. 2016, 25, 5012–5024. [Google Scholar] [CrossRef] [PubMed]
Chen, W.; Shi, K. Multi-scale Attention Convolutional Neural Network for time series classification. Neural Netw. 2021, 136, 126–140. [Google Scholar] [CrossRef] [PubMed]
Zeiler, M.D.; Krishnan, D.; Taylor, G.W.; Fergus, R. Deconvolutional Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 2528–2535. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1–11. [Google Scholar]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 3859–3869. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014; pp. 1–14. [Google Scholar]
Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2014; Volume 32, pp. 1278–1286. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
Hihn, H.; Gottwald, S.; Braun, D.A. Bounded Rational Decision-Making with Adaptive Neural Network Priors. In Proceedings of the Artificial Neural Networks in Pattern Recognition (ANNPR), Siena, Italy, 19–21 September 2018; pp. 213–225. [Google Scholar] [CrossRef]
Hihn, H.; Braun, D.A. Specialization in Hierarchical Learning Systems. Neural Process. Lett. 2020, 52, 2319–2352. [Google Scholar] [CrossRef]
Genewein, T.; Leibfried, F.; Grau-Moya, J.; Braun, D.A. Bounded Rationality, Abstraction, and Hierarchical Decision-Making: An Information-Theoretic Optimality Principle. Front. Robot. AI 2015, 2, 27. [Google Scholar] [CrossRef]
Hihn, H.; Braun, D.A. Hierarchical Expert Networks for Meta-Learning. In Proceedings of the 4th International Conference on Machine Learning (ICML) Workshop on Life Long Machine Learning, Vienna, Austria, 12–18 July 2020; pp. 1–11. [Google Scholar]
Ortega, P.A.; Braun, D.A. Thermodynamics as a theory of decision-making with information-processing costs. Proc. R. Soc. A Math. Phys. Eng. Sci. 2013, 469, 20120683. [Google Scholar] [CrossRef]
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; pp. 1–22. [Google Scholar]
Steur, N.A.K.; Schwenker, F. A Step Towards Neuroplasticity: Capsule Networks with Self-Building Skip Connections. AI 2025, 6, 1. [Google Scholar] [CrossRef]
Seung, H.S. Learning Continuous Attractors in Recurrent Networks. Adv. Neural Inf. Process. Syst. 1997, 10, 654–660. [Google Scholar]
Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and Composing Robust Features with Denoising Autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML), Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar] [CrossRef]
Im, D.J.; Ahn, S.; Memisevic, R.; Bengio, Y. Denoising Criterion for Variational Auto-Encoding Framework. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31, pp. 2059–2065. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; Frey, B. Adversarial Autoencoders. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016; pp. 1–16. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Creswell, A.; Bharath, A.A. Denoising Adversarial Autoencoders. IEEE Trans. Neural Networks Learn. Syst. 2019, 30, 968–984. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Deng, L. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
Hinton, G.; Sabour, S.; Frosst, N. Matrix Capsules with EM Routing. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–15. [Google Scholar]
Hinton, G.E.; Krizhevsky, A.; Wang, S.D. Transforming Auto-Encoders. In Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN), Espoo, Finland, 14–17 June 2011; pp. 44–51. [Google Scholar] [CrossRef]
Steur, N.A.K.; Schwenker, F. Next-Generation Neural Networks: Capsule Networks With Routing-by-Agreement for Text Classification. IEEE Access 2021, 9, 125269–125299. [Google Scholar] [CrossRef]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 1–13. [Google Scholar]
Barbano, C.A.; Dufumier, B.; Tartaglione, E.; Grangetto, M.; Gori, P. Unbiased Supervised Contrastive Learning. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023; pp. 1–24. [Google Scholar]
Poole, B.; Ozair, S.; van den Oord, A.; Alemi, A.A.; Tucker, G. On Variational Bounds of Mutual Information. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019; Volume 97, pp. 5171–5180. [Google Scholar]
Böhm, V.; Seljak, U. Probabilistic Autoencoder. Trans. Mach. Learn. Res. 2022, 1–25. [Google Scholar] [CrossRef]
Rezende, D.J.; Mohamed, S. Variational Inference with Normalizing Flows. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 1–10. [Google Scholar]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density Estimation Using Real NVP. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; pp. 1–32. [Google Scholar]
Steur, N.A.K.; Schwenker, F. Class-Variational Learning With Capsule Networks for Deep Entity-Subspace Clustering. IEEE Access 2023, 11, 117368–117384. [Google Scholar] [CrossRef]
Rezaabad, A.L.; Vishwanath, S. Learning Representations by Maximizing Mutual Information in Variational Autoencoders. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 2729–2734. [Google Scholar] [CrossRef]
Chapelle, O.; Weston, J.; Bottou, L.; Vapnik, V. Vicinal Risk Minimization. Adv. Neural Inf. Process. Syst. 2000, 13, 1–7. [Google Scholar]
Tian, Y.; Krishnan, D.; Isola, P. Contrastive Multiview Coding. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 776–794. [Google Scholar] [CrossRef]
Liu, Y.; Fan, Q.; Zhang, S.; Dong, H.; Funkhouser, T.; Yi, L. Contrastive Multimodal Fusion with TupleInfoNCE. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 734–743. [Google Scholar] [CrossRef]
Sermanet, P.; Lynch, C.; Chebotar, Y.; Hsu, J.; Jang, E.; Schaal, S.; Levine, S. Time-Contrastive Networks: Self-Supervised Learning from Video. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 1134–1141. [Google Scholar] [CrossRef]
Xie, S.; Gu, J.; Guo, D.; Qi, C.R.; Guibas, L.; Litany, O. PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 574–591. [Google Scholar] [CrossRef]
Yang, X.; Zhang, Z.; Cui, R. TimeCLR: A self-supervised contrastive learning framework for univariate time series representation. Knowl.-Based Syst. 2022, 245, 108606. [Google Scholar] [CrossRef]
Yu, J.; Gao, X.; Zhai, F.; Li, B.; Xue, B.; Fu, S.; Chen, L.; Meng, Z. An adversarial contrastive autoencoder for robust multivariate time series anomaly detection. Expert Syst. Appl. 2024, 245, 123010. [Google Scholar] [CrossRef]
Mai, S.; Zeng, Y.; Hu, H. Learning from the global view: Supervised contrastive learning of multimodal representation. Inf. Fusion 2023, 100, 101920. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450v1. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415v5. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
Python Programming Language. Python Software Foundation: Beaverton, OR, USA. Available online: https://www.python.org/ (accessed on 18 September 2025).
Chollet, F. Keras. 2015. Available online: https://keras.io (accessed on 18 September 2025).
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2015, arXiv:1603.04467v2. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the Seventh International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019; pp. 1–11. [Google Scholar]
Kingma, D.P.; Lei Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar] [CrossRef]
Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. In Proceedings of the Sixth International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–9. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
Fragoso, T.M.; Bertoli, W.; Louzada, F. Bayesian Model Averaging: A Systematic Review and Conceptual Classification. Int. Stat. Rev. 2018, 86, 1–28. [Google Scholar] [CrossRef]
Hoeting, J.A.; Madigan, D.; Raftery, A.E.; Volinsky, C.T. Bayesian Model Averaging: A Tutorial. Stat. Sci. 1999, 14, 382–417. [Google Scholar] [CrossRef]
TensorFlow Datasets: A Collection of Ready-to-Use Datasets. Available online: https://www.tensorflow.org/datasets (accessed on 12 March 2025).

Figure 1. The general architecture of the MM-SCAAE pretraining model for fusing the information from multi-sensor input data into a global object representation. A Denoising Variational AutoEncoder (DVAE) per input modality acts as a core component (center), which fosters the learning of an expressive latent space. The latent representation spaces are regularized by separated adversarial networks and a common fusion network. The adversarial regularization (left) aligns the output space of each encoder with a shared prior distribution type but individually adapts the distribution parameters. The late fusion network (right) aggregates the unimodal encodings into a global representation space with class partitions induced by the supervised contrastive objective. The global latent space is also regularized by an adaptive prior distribution. The processing course of the three displayed input modalities is highlighted through the individual colors of blue, yellow and (light) red.

Figure 2. Implementation of MM-SCAAE’s model components for the task of multi-sensor pain classification. (a) Variational encoder component of each unimodal autoencoder. The encoder extracts features of varying granularity with window size

w \in {0.5, 1, 2}

in seconds using a Multi-Scale Convolutional Neural Network (MSCNN) and produces parameter estimates for a local Gaussian distribution. (b) Denoising decoder component of each unimodal autoencoder. The decoder receives a stochastically sampled latent representation drawn from the local Gaussian distribution predicted by the encoder, and reconstructs the uncorrupt input. (c) The late fusion network aggregates the multimodal information into a global object representation via two sequential self-attention transformer layers. Within each subfigure, identically colored network layers share the same layer type.

Figure 3. Classification performance evaluation per pain intensity (

T_{1}

,

T_{2}

,

T_{3}

,

T_{4}

) versus the no-pain baseline (

T_{0}

). (a) ROC curve and AUC values document the change in performance for various classification thresholds. (b) Violin plot for each LOSO accuracy distribution.

Figure 4. Growth in two-class (

T_{0}

vs.

T_{4}

) LOSO accuracy over grouped predictions per patient, with an increasing number of samples per group. Each box plot visualizes the LOSO accuracy distribution for the respective group size, which is approximated through MC sampling of random groups over 100 iterations. The accuracy gain potential (gray-shaded area) signifies the performance discrepancy between deterministic single-sample classification (red-dotted line) and pain class prediction grouped over all test instances (blue-dotted line).

Table 1. Classification metrics with standard deviations for the LOSO cross validation using two-class pretraining. Except for Cohen’s

κ

, all metrics are stated in %. The best value per classification metric is stated in bold.

Table 1. Classification metrics with standard deviations for the LOSO cross validation using two-class pretraining. Except for Cohen’s

κ

, all metrics are stated in %. The best value per classification metric is stated in bold.

Task	Sensitivity/Recall	Precision	Specificity	Cohen’s $κ$	$F_{1}$ -Score	Accuracy
$T_{0}$ vs. $T_{1}$	$60.64 (13.20)$	$44.83 (23.71)$	$58.09 (11.44)$	$0.15 (0.19)$	$54.99 (10.85)$	$57.61 (9.27)$
$T_{0}$ vs. $T_{2}$	$67.24 (19.44)$	$51.21 (29.60)$	$65.07 (15.75)$	$0.29 (0.29)$	$62.06 (16.93)$	$64.74 (14.36)$
$T_{0}$ vs. $T_{3}$	$75.75 (14.44)$	$71.26 (24.56)$	$75.98 (16.47)$	$0.49 (0.28)$	$73.50 (15.17)$	$74.40 (14.10)$
$T_{0}$ vs. $T_{4}$	$85.74 (11.13)$	$81.15 (22.47)$	$84.96 (15.52)$	$0.68 (0.26)$	$83.72 (14.07)$	$84.22 (13.12)$

Table 2. Classification accuracy (in %) with standard deviation within the LOSO cross validation for varying sensor availability. Sensor values of 0 or 1 signify sensor state of deactivation or activation, respectively. The best accuracy value per classification task is stated in bold.

Sensors			Tasks
EDA	ECG	EMG	$T_{0} vs . T_{1}$	$T_{0} vs . T_{2}$	$T_{0} vs . T_{3}$	$T_{0} vs . T_{4}$
0	0	1	$49.74 (2.65)$	$49.94 (3.47)$	$50.06 (6.65)$	$53.99 (8.34)$
0	1	0	$50.40 (1.81)$	$50.26 (2.40)$	$51.81 (5.78)$	$56.70 (10.89)$
0	1	1	$49.37 (4.57)$	$49.97 (3.42)$	$52.53 (7.19)$	$59.86 (11.85)$
1	0	0	$57.07 (9.66)$	$63.65 (14.52)$	$74.45 (13.53)$	$83.56 (13.55)$
1	0	1	$57.53 (9.97)$	$64.40 (14.56)$	$74.17 (13.79)$	$83.30 (13.89)$
1	1	0	$57.70 (9.94)$	$64.08 (14.25)$	$74.94 (13.57)$	$83.42 (14.14)$
1	1	1	$57.61 (9.27)$	$64.74 (14.36)$	$74.40 (14.10)$	$84.22 (13.12)$

Table 3. Classification accuracy (in %) with standard deviation within the LOSO cross validation using distinct multi-class pretraining scenarios. The best accuracy value per classification task is stated in bold.

Multi-Class Pretraining	SupCon Loss	Tasks
Multi-Class Pretraining	SupCon Loss	$T_{0}$ vs. $T_{1}$	$T_{0}$ vs. $T_{2}$	$T_{0}$ vs. $T_{3}$	$T_{0}$ vs. $T_{4}$
{ $T_{0}$ , $T_{2}$ , $T_{4}$ }	nominal	-	$65.34 (14.37)$	-	$83.39 (13.44)$
{ $T_{0}$ , $T_{2}$ , $T_{4}$ }	ordinal	-	$64.66 (14.79)$	-	$83.22 (13.43)$
{ $T_{0}$ , $T_{1}$ , $T_{2}$ , $T_{3}$ , $T_{4}$ }	nominal	$57.24 (10.57)$	$63.59 (13.83)$	$72.13 (13.88)$	$80.72 (13.40)$
{ $T_{0}$ , $T_{1}$ , $T_{2}$ , $T_{3}$ , $T_{4}$ }	ordinal	$57.39 (10.13)$	$62.39 (13.34)$	$70.26 (12.28)$	$79.89 (13.49)$

Table 4. LOSO cross validation performance comparison of pain recognition models on the BioVid heat pain database using classification accuracy and

F_{1}

-Score (in %). Values are stated in the format {Accuracy/

F_{1}

-Score}. The best values are individually emphasized in bold for unimodal and multimodal models per classification task. Additionally, the best values per section for the model group with diverging segmentation scheme (*) are underlined. The (✓) symbol signifies the use of a certain modality for the task of pain recognition.

Table 4. LOSO cross validation performance comparison of pain recognition models on the BioVid heat pain database using classification accuracy and

F_{1}

-Score (in %). Values are stated in the format {Accuracy/

F_{1}

-Score}. The best values are individually emphasized in bold for unimodal and multimodal models per classification task. Additionally, the best values per section for the model group with diverging segmentation scheme (*) are underlined. The (✓) symbol signifies the use of a certain modality for the task of pain recognition.

Method		Modalities				Tasks
Method		EDA	ECG	EMG	Video	$T_{0} vs . T_{1}$	$T_{0} vs . T_{2}$	$T_{0} vs . T_{3}$	$T_{0} vs . T_{4}$
Unimodal	HCFs + Random Forest [3]				✓	$51.20 /$ -	$54.40 /$ -	$62.90 /$ -	$71.60 /$ -
	HCFs + Logistic Regression * [5]	✓				$54.57 /$ -	$59.40 /$ -	$66.00 /$ -	$74.21 /$ -
	CNN * [6]	✓				$\underset{̲}{61.67} /$ -	$\underset{̲}{66.93} /$ -	$\underset{̲}{76.38} /$ -	$\underset{̲}{84.57} /$ -
	CNN * [6]		✓			$49.71 /$ -	$50.72 /$ -	$52.87 /$ -	$57.04 /$ -
	CNN * [6]			✓		$49.71 /$ -	$50.29 /$ -	$53.25 /$ -	$58.65 /$ -
	Attention Networks [4]				✓	-/-	-/-	-/-	$69.25 /$ -
	MLP + Random Forest [8]	✓				$59.08 /$ -	$65.09 /$ -	$75.14 /$ -	$84.22 /$ -
	Transformer Encoder [13]	✓				$56.70 / 55.97$	$68.78 / 68.79$	$77.41 / 77.37$	$85.56 / 85.49$
	CA Transformer Encoder [14]	✓				$63.99 / 63.89$	$70.05 / 70.03$	$78.24 / 77.23$	$86.21 / 85.85$
Multimodal	HCFs + Random Forest [3]	✓	✓	✓		$54.90 /$ -	$59.30 /$ -	$65.00 /$ -	$74.10 /$ -
	HCFs + Random Forest [3]	✓	✓	✓	✓	$54.60 /$ -	$60.00 /$ -	$67.70 /$ -	$77.80 /$ -
	HCFs + Random Forest [9,10]	✓	✓	✓	✓	-/-	-/-	-/-	$83.10 /$ -
	HCFs + Random Forest [11]	✓	✓	✓		-/-	-/-	-/-	$82.73 /$ -
	HCFs + Linear SVM * [5]	✓	✓			-/-	-/-	-/-	$72.20 /$ -
	CNN * [6]	✓	✓	✓		$\underset{̲}{61.15} /$ -	$\underset{̲}{66.81} /$ -	$\underset{̲}{76.29} /$ -	$\underset{̲}{84.40} /$ -
	DAE * [7]	✓	✓	✓		-/-	-/-	-/-	$83.99 /$ -
	Adaptive DVAE * [12]	✓	✓	✓		-/-	-/-	-/-	$84.25 / \underset{̲}{81.48}$
	MM-SCAAE (ours)	✓	✓	✓		$57.61 / 54.99$	$64.74 / 62.06$	$74.40 / 73.50$	$84.22 / 83.72$

Table 5. LOSO cross validation performance comparison of pain recognition models on the BioVid heat pain database using the sensitivity and specificity classification metrics (in %). Values are stated in the format {Sensitivity/Specificity}. The best metric value per classification task is stated in bold.

Method	Tasks
Method	$T_{0} vs . T_{1}$	$T_{0} vs . T_{2}$	$T_{0} vs . T_{3}$	$T_{0} vs . T_{4}$
Transformer Encoder [13]	$69.54 / 43.85$	$70.48 / 67.10$	$73.16 / 81.67$	$82.70 / 88.28$
CA Transformer Encoder [14]	$64.17 /$ -	$70.05 /$ -	$78.27 /$ -	$85.89 /$ -
MM-SCAAE (ours)	$60.64 / 58.09$	$67.24 / 65.07$	$75.75 / 75.98$	$85.74 / 84.96$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Multimodal Pain Recognition Based on Contrastive Adversarial Autoencoder Pretraining

Abstract

1. Introduction

2. Methods

2.1. Model Overview

2.2. Variational Autoencoder

2.2.1. Bernoulli Denoising Criterion

2.2.2. Adversarial Regularizer

2.3. Supervised Contrastive Objective

2.3.1. Similarity Measure

2.3.2. ϵ –SupInfoNCE Loss

2.3.3. Ordinal ϵ –SupInfoNCE Loss

2.4. Mutual Information Constraints

3. Related Work

4. Results

4.1. BioVid Dataset

4.2. Pain Recognition Task

4.3. Data Preprocessing and Augmentation

4.3.1. Instance-Based Normalization

4.3.2. Random Crop

4.3.3. Modality Replacement

4.3.4. Modality Dropout

4.4. Network Components

4.5. Global Setup

4.6. Experiments

4.6.1. Classification Performance

4.6.2. Modality Significance

4.6.3. Ordinal-Aware Multi-Class Pretraining

4.6.4. Performance Comparison

4.6.5. Grouped-Prediction Estimate

5. Discussion

5.1. Perspectives on Pain Assessment

5.2. Generic Multi-Sensor Pretraining for Time-Series Classification

5.3. Supervised Contrastive Adversarial AutoEncoder (SCAAE)

5.4. Limitations of This Study and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Variational Lower Bound and Denoising Criterion

Appendix B. Confidence-Aware Similarity Measure

Appendix C. Ordinality-Aligned ϵ–SupInfoNCE Derivation

Appendix D. Robust Supervised Adversarial AutoEncoder (SAAE)

Appendix D.1. Experimental Setup

Appendix D.2. Nominal ϵ–SupInfoNCE Loss

Appendix D.3. Ordinal ϵ–SupInfoNCE Loss

Appendix E. Summary of the Global Setup

Appendix F. Bayesian Model Averaging via Prediction Grouping

References

Article Metrics

Citations

Article Access Statistics

2.3.2. $ϵ$ –SupInfoNCE Loss

2.3.3. Ordinal $ϵ$ –SupInfoNCE Loss