Randomized Modality Mixing with Patchwise RBF Networks for Robust Multimodal Pain Recognition

Erdal, Mehmet; Gruss, Sascha; Walter, Steffen; Schwenker, Friedhelm

doi:10.3390/computers15020127

Open AccessArticle

Randomized Modality Mixing with Patchwise RBF Networks for Robust Multimodal Pain Recognition^†

¹

Institute of Neural Information Processing, Ulm University, James-Franck-Ring, 89081 Ulm, Germany

²

Medical Psychology Group, University Clinic, 89075 Ulm, Germany

^*

Authors to whom correspondence should be addressed.

^†

This article is a revised and expanded version of a paper entitled “ModMix: Data Augmentation for Multimodal Pain Detection”. In Proceedings of the ICPR 2024 International Workshops and Challenges, Kolkata, India, 1–5 December 2024.

Computers 2026, 15(2), 127; https://doi.org/10.3390/computers15020127

Submission received: 14 November 2025 / Revised: 3 February 2026 / Accepted: 11 February 2026 / Published: 14 February 2026

(This article belongs to the Section Human–Computer Interactions)

Download

Browse Figures

Versions Notes

Abstract

Pain recognition based on multimodal physiological signals remains a challenge, not only because of the limited training data, but also due to the varying responses of individuals. In this article, we present a randomized modality mixing technique (Modmix) for multimodal data augmentation and a patchwise radial basis function (RBF) network designed to improve robustness in limited and highly heterogeneous data. Modmix generates new samples by randomly swapping modalities between existing data points, creating new data in a very simple but effective way. The RBF patch network divides the input into randomly selected, overlapping patches that capture local similarities between modalities. Each patch network is trained end-to-end using stochastic gradient descent. Moreover, the model’s performance is further improved by using multiple independently trained networks and combining them into a single decision. Experiments with the two different pain datasets X-ITE and BioVid were performed under limited training data conditions, where only approximately 30% of the original datasets were used for training. With both datasets the RBF patch network achieved significant improvements for a subset of subjects, resulting in a similar or even slightly better mean accuracy compared to competing related models such as random forest and support vector machine.

Keywords:

data augmentation; mixup; modmix; radial basis function networks

1. Introduction

Automatic pain recognition is required in situations where patients cannot communicate their pain verbally in the usual way. As well as with many other fields, machine learning is delivering promising results in pain recognition. In and of itself, machine learning can be viewed as a fusion of statistical inference and function approximation. Both bring their own challenges, summarized in the so-called bias–variance trade-off. Here, variance measures how well a model generalizes from finite data, while bias measures how well it can represent the target function. In the area of deep learning where models can represent arbitrary functions, bias plays a rather minor role, and the main focus is on variance, which usually arises from insufficient data, noise, or unrepresentative distributions. Interestingly, areas such as pain recognition suffer primarily from a lack of data, as the corresponding handcrafted datasets were designed by experts who know which signals to measure, how to measure them, and, above all, where to look for the required features in the signal. Such high-quality datasets are therefore extremely expensive and possibly even ethically questionable.

For this reason, among others, researchers have started to use data augmentation techniques to generate additional samples from existing ones [1]. However, while standard techniques such as geometric transformations or noise addition offer limited novelty, mixup augmentation [2] produces new samples through convex combinations of data samples and labels. In [3], we propose a mixup variant to address multimodal pain detection where feature vectors are composed of self-contained modalities originating from different signal sources. The relatively high degree of independence between the modalities makes it possible to generate new data by simply exchanging modalities between existing data samples. The introduction of randomness to improve generalization is a well-known and widely used approach in machine learning, for example, in dropout regularization, stochastic gradient descent learning, and especially in random forests where randomization is applied at two levels: at the dataset level by sampling subsets for each decision tree and at the feature level by using only a sample of features for each tree. By averaging the predictions of these simple trees with consequently high variance, the ensemble achieves robust variance reduction.

Likewise, we introduce a radial basis function (RBF) network that operates at localized random patches of the feature vector. Each patch represents a predefined, randomly selected multi-set of features and provides a localized view of the high-dimensional data to capture different structures distributed across different modalities. We then combine multiple different RBF patch networks and take the average output as the final decision, with each network trained on somewhat different data augmented with Modmix. Unlike conventional neurons, RBF neurons measure similarity based on the distance between points and simultaneously represent an inner product of a higher-dimensional space in which linear decision boundaries can be found. This trick is also used in support vector machines (SVMs). However, SVMs are far less flexible and cannot be easily adapted to a patch-based approach such as neural networks. In addition, they often require a large number of support vectors to represent the decision functions. Therefore, our RBF patch network combines the compactness and flexibility of neural networks with the advantages of kernel-based optimization.

The rest of this article is structured as follows: Section 2 provides a concise overview of previous research on pain recognition and its machine learning approaches. We then present our proposed RBF patch network and Modmix data augmentation strategy in detail in Section 3. The multimodal pain datasets X-ITE and BioVid, which we use in our experiments, are introduced in Section 5. The results and discussion of the experiments are presented in Section 6. We then conclude this article in Section 7 with a summary of our findings and possible future research directions.

2. Related Work

SVMs with radial basis function (RBF) kernels and random forests have already proven to be powerful models in the field of pain recognition based on facial expressions, electromyography, or physiological signals. Both have been widely used in datasets such as UNBC-McMaster [4], where they performed exceptionally well. For a multimodal distribution in which heterogeneous features such as facial action units and skin conductance are combined, random forests, in particular, have been shown to be effective in maintaining robustness to noise and small sample sizes. In this way, these results underscore the importance of localized similarity measures and ensemble averaging for reliable pain detection.

Deep learning and its powerful model have opened up entirely new possibilities for recognizing useful patterns in facial, physiological, and sometimes even brain activity signals [5,6,7]. Recently, radial basis function networks have regained popularity due to their ability to recognize local patterns in input data, which also makes them easier to interpret. These properties can be especially helpful in pain detection, as various studies have already demonstrated [8,9]. The closest study to our work is [10], where RBF networks are used in convolutional layers with patch-like local connections. Just like conventional neural networks, RBF networks are universal approximators. Even so, universal approximation theorems prove expressiveness but make no assumptions about how much data is needed.

Recent transformer-based architectures have shown promising results in capturing temporal dependencies in electrode activity (EDA) and electrocardiogram (ECG) signals [11,12,13]. In [14] a cross-modality framework was introduced, which achieved state-of-the-art performance (87.52% accuracy on BioVid dataset) by employing a two-stage fusion strategy: first extracting intramodal features using transformer blocks for each signal independently and then fusing EDA and ECG representations through a dedicated cross-modal transformer.

On the other hand, availability of sufficient data is a common challenge in machine learning, especially in deep learning. Data augmentation has thus become a crucial part of developing successful machine learning models. Nonetheless, generating representative synthetic data is a highly complex problem in itself, but fortunately, mixup is a very simple method that delivers extremely good results. Over time, other variants of mixup have emerged, including CutMix [15], which is closest to our work. CutMix combines images by cutting out an area from one image and pasting it into another. The corresponding labels are mixed proportionally to the area of the pasted region. CutMix encourages the model to focus on critical parts of the image when making predictions, improving generalization and robustness, especially for tasks such as object detection.

3. Materials and Methods

3.1. RBF Patch Network

The classic architecture of an RBF network corresponds to a feedforward neural network with only one hidden layer and Gaussians (radial basis functions) as activation. The fundamental difference from conventional neural networks is that the Euclidean distance

∥ x - c ∥

from the feature vector x to the weight vector c is used as input for activation

ϕ

instead of the dot product

c^{T} x

.

Φ (x) = exp (- \frac{{∥ x - c ∥}^{2}}{2 σ^{2}}) .

(1)

The vector c can be considered as some center in the input space, where

∥ x - c ∥

defines a spherical activation area of the neuron in which the center c provides maximum activation. The width parameter

σ

determines how sensitive the neuron is to distances from its center c and is usually derived together with c from the data in a pre-training phase using a cluster algorithm or as part of a global learning algorithm such as gradient descent.

Our proposed network follows this classical structure as shown in Figure 1, but instead of connecting each hidden neuron to the entire input vector, we divide the input into randomly sampled, smaller, possibly overlapping mulitsets, which we will call patches. Each random patch can contain features of each modality and is assigned to a fixed neuron. The advantage of this approach is twofold: the dimensionality reduction preserves the meaningfulness of local distances while maintaining activation diversity, and the randomly selected features enables the model to detect localized similarities across modalities. The outputs of all these local detectors are then linearly combined to make the final prediction

f (x) = w_{0} + \sum_{j = 1}^{k} w_{j} Φ_{j} (x_{P_{j}}) .

(2)

3.1.1. Mahalanobis Distance

Apart from losing its meaning as a distance measure in high-dimensional spaces, there is another limiting characteristic of the Euclidean standard distance: it does not take into account correlations or differences in scale between features. This means that features with larger variances can distort the distance in much the same way that outliers can distort the mean. For this reason we use the Mahalanobis distance

D_{M}

,

D_{M} (x, c) = \sqrt{{(x - c)}^{T} Σ^{- 1} (x - c)}

(3)

where the data is first transformed into a space where all features are uncorrelated. In this way points that form an elliptical cluster will be transformed into a space where their cluster is spherical, so that the Euclidean distance will not be distorted.

In the case where the features are uncorrelated, the covariance matrix

Σ

is diagonal and the points already form a spherical cluster. Then

D_{M}

is just a normalized version of the Euclidean distance. In other cases consider that the term under the root in (3) is generally an n-ellipsoid. Since

Σ

is symmetric, it has an eigenvalue decomposition with orthonormal matrix P and diagonal matrix D, so that its inverse

Σ^{- 1}

can be represented by

Σ^{- 1} = {(P D P^{T})}^{- 1} = P D^{- 1} P^{T}

(4)

Putting (4) into (3) while using

D^{- 1} = D^{- \frac{1}{2}} D^{- \frac{1}{2}}

and

S = D^{- \frac{1}{2}} P^{T}

leads to

\begin{matrix} D_{M} (x, c) & = \sqrt{{(D^{- \frac{1}{2}} P^{T} (x - c))}^{T} (D^{- \frac{1}{2}} P^{T} (x - c))} \\ = \sqrt{{(S (x - c))}^{T} (S (x - c))} \\ = ∥ S x - S c ∥ \end{matrix}

(5)

In (5) we eventually see that the Euclidean distance is applied on transformed points. Using

D_{M}

in (1) we get

Φ (x) = exp (- \frac{1}{2} {(x - c)}^{⊤} Σ_{j}^{- 1} (x - c))

(6)

3.1.2. Patch Sampling

Patch sampling plays a crucial role in the architecture of the RBF patch network, where the number m of patches determines the number of neurons processing inputs of size k. The elements of the patches are indices of the feature vector with length n, which are sampled uniformly with replacement.

Some features may thus occur in more than one patch, which on the one hand leads to slight correlations between the neurons and on the other hand offers the neurons different perspectives on the same feature. Some features might not occur in any patches, which is definitely the case for

m k < n

. This could be considered as some form of regularization. It is also possible that some features occur several times in the same patch, giving them more weight in the distance calculation. For a given patch the probability of this happening is

P (at least 2 same features) = 1 - \frac{n (n - 1) \dots (n - k + 1)}{n^{k}}

(7)

To avoid excessive feature redundancy within patches one should keep k relatively small.

3.1.3. Relation to RBF-SVMs

Another well-established and widely used model that uses RBF functions is the support vector machine (SVM) with RBF kernels. Both the RBF-SVM and the RBF network have fundamentally different theoretical backgrounds; however, they represent decision functions of nearly identical algebraic form. For the RBF-SVM, the decision function is given by

f (x) = w_{0} + \sum_{i \in S} w_{i} y_{i} Φ (x, x_{i}),

(8)

where

x_{i}

are the support vectors,

y_{i}

are their corresponding class labels, and

Φ

uses (1) as a kernel. Each model combines nonlinear basis functions linearly to create a linear classifier in a higher-dimensional space, where these basis functions represent corresponding inner products of the inputs.

However, the approach to determine the decision function differs fundamentally, as does the way in which they separate the classes in the input space. The SVM attempts to maximize the boundary between the classes by choosing as a center the subset of the training data that contains only points closest to the decision boundary on each side. These points are called support vectors, and their number can be as large as the dataset in the worst case. In addition, the same width parameter is used in all kernels. The RBF network is much more flexible; its architecture is essentially user-defined, with the number of neurons being a design decision and the centers ultimately being located anywhere in the input space. Furthermore, the width parameters are also typically learned and can differ between neurons. The RBF network is easy to customize and extend, as in this work. This flexibility often results in more compact models that perform similarly or even better than SVMs.

3.1.4. Relation to Random Forests

In this work, we combine several RBF patch networks with each other and take the average of the individual outputs as a decision. This approach has some similarities with other ensemble models, in particular with random forests. A random forest combines multiple decision trees, each trained on a randomly selected subset of the training data. Furthermore, the individual decision trees are created using only a randomly selected subset of features. This two-stage randomization across data and feature dimensions leads to diversity that effectively reduces model variance and improves generalization.

Similarly, the RBF patch network is randomized at the feature level, with randomly selected patches assigned to individual neurons. Each patch is a multi-set supported by a subset of the input features. Technically the RBF patch network creates a weighted average of single outputs based on subsets of the features but has no randomization at the data level like the random forest does. Randomization at the data level is then introduced by training the individual RBF patch networks on a slightly modified version of the dataset. This difference results from the addition of a certain percentage of augmented data to each network. The artificial data is generated using Modmix, which can be considered a kind of third level of randomization, as new data is essentially generated by the random combination of modalities.

3.2. Data Augmentation with Modmix

Randomized modality mixing (Modmix) is a special form of the mixup data augmentation method. To appreciate the idea of mixup we will first give an informal description of the underlying core principle, vicinal risk minimization (VRM).

3.2.1. Vicinal Risk Minimization

Machine learning in general, and neural network learning in particular, is concerned with the problem of estimating an unknown function based on a finite data sample. Suppose the training data comes from

S = X \times Y

with the joint probability distribution p. Here

X

represents the input space and

Y

represents the label space. Let h be the function represented by a given model and L measure the loss

L (f (x), y)

of h when it makes a prediction with

(x, y) \in S

; then the expected loss of f over S is given by

R (h) = \underset{S}{\int \int} p (x, y) L (h (x), y) d y d x .

(9)

The expected loss is also referred as the true risk. However since the distribution p is unknown, R is estimated with

\hat{R} (h) = \frac{1}{n} \sum_{(x_{i}, y_{i}) \in D} L (h (x_{i}), y_{i}) .

(10)

where

D = {(x_{i}, y_{i})}_{i = 0}^{n}

is a finite sample of training data.

\hat{R}

(h) is accordingly called the empirical risk. The goal is now to choose h such that

\hat{R}

is minimized. However, since

\hat{R}

is only an estimate, the performance of h for unseen data highly depends on the quality of

\hat{R}

. And like any sample-based statistic, the quality of

\hat{R}

increases with the sample size. However, generating new data can be very expensive or even impossible. Instead of using complex methods to generate artificial data, VRM suggests sampling new points from the neighborhood (vicinity) of individual training points

x_{i}

, assuming that the corresponding label

y_{i}

changes only slightly, if at all [16,17]. Let

v (x_{i})

be the neighborhood of

x_{i}

; the vicinal risk V is then defined as follows:

V (h) = \frac{1}{n} \sum_{(x_{i}, y_{i}) \in D} L (E [h (x_{i})], y_{i}), where E [h (x_{i})] = \int_{v (x_{i})} h (x) p (x ∣ v (x_{i})) d x .

(11)

3.2.2. Mixup

In mixup the vicinity of point

x_{i}

is based on a convex combination with another point

x_{j}

, while the parameter

λ

for this combination is sampled from a symmetric beta distribution

v (x_{i}) = {λ x_{i} + (1 - λ) x_{j} | x_{j} \in D, λ \sim Beta (α, α)} .

(12)

Notice how all points in D can belong to the vicinity

v (x_{i})

and that the parameter

α

of the beta distribution determines the probability of the location of the interpolated points. The beta distribution models possible values in the interval

λ \in [0, 1]

; while the mean

μ_{λ}

is always

\frac{1}{2}

, the variance is

Var (λ) = \frac{1}{4 (2 α + 1)} .

(13)

With an increasing

α

, the variance

Var (λ)

decreases and the

λ

s will cluster increasingly close to

μ_{λ}

, placing interpolated points rather in the middle of the line. On the other side for

α = 0

the beta distribution collapses to a discrete Bernoulli distribution with

p = 0.5

, which means that mixup will not create new points anymore. However we will see that this is not the case for Modmix. Since mixup creates points between every possible pair of D, regardless of the class, the label of

\hat{x}

is also a convex combination, calculated with the same

λ

used for

\hat{x}

:

\hat{y} = λ y_{i} + (1 - λ) y_{j} .

(14)

It is interesting to see what kind of figure mixup creates. Let

Conv (D)

be the convex polytope induced by the dataset D; then the set

M \subset Conv (D)

of all points is

M = ⋃_{x \in D} v (x)

(15)

In

M

all lines are connected and can be imagined as some kind of mesh of

Conv (D)

.

3.2.3. Modmix

Let

x^{T} = (x_{1}, x_{2}, \dots x_{m}) \in R^{d}

be a multimodal feature vector, where

x_{i} \in R^{d_{i}}

represents the i-th multidimensional modality and

d = \sum d_{i}

. In Modmix a new point

\hat{x}

is created by randomly selecting two points

x_{i}, x_{j}

and then randomly interchanging modalities between those points. To see why this is a convex combination, let

I_{k}

be an

(k \times k)

identity matrix,

A

be a

(d \times d)

block-diagonal matrix and

λ_{i}, \dots λ_{m} \in {0, 1}

be drawn independently from a Bernoulli distribution with

p = 0.5

; then

\hat{x} = A x_{i} + (I_{d} - A) x_{j} where A = (\begin{matrix} λ_{1} I_{d_{1}} & 0 & \dots & 0 \\ 0 & λ_{2} I_{d_{2}} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & λ_{m} I_{d_{m}} \end{matrix})

(16)

Each

λ_{k}

chooses whether the entire modality

x_{k}

is taken from

x_{i}

or from

x_{j}

. To get a notion of the geometry of Modmix let

X_{i}

be the set of all modality instances of the i-th modality in D; then the Cartesian product

G = X_{1} \times X_{2} \times \dots \times X_{m}

is all possible modality combinations. The Modmix points

W \subset G

are a special subset of

G

where each point contains only modalities from at most 2 different points. To see how many points Modmix can generate, let n be the number of available points and i be the number modalities being exchanged in generating

\hat{x}

; then the number of possible new points is

2 \sum_{i = 1}^{m - 1} (\binom{n}{2}) (\binom{m}{i}) = (\binom{n}{2}) (2^{m + 1} - 2^{2})

(17)

Consequently the growth rate

Θ (n^{2} 2^{m})

of Modmix increases quadratically with the number n of original points and exponentially with the number m of modalities. In our study we are only concerned with binary classification with label

y \in {0, 1}

, and if the two points

x_{i}, x_{j}

are from different classes, then we label the generated point as follows:

\hat{y} = \frac{t}{m}

(18)

where t is the number of modalities belonging to class 1.

This type of labeling can be interpreted as majority voting without decision at the modality level. Since individual modalities are represented by separate feature vectors, each modality can be assigned its own label. When multiple modalities are combined, the resulting sample is labeled using the corresponding pain ratio. This leads to soft targets similar in form to label interpolation in mixup and mixup based works [18,19,20].

This labeling scheme penalizes too sharp decision boundaries by inducing target distributions in regions corresponding to mixed modalities with the goal of promoting stability under unseen modality deviations. As with other regularization techniques, however, this comes with no guarantee of performance improvement and may in some cases push the model away from a better solution, as has been analyzed for mixup [21,22,23].

4. X-ITE and BioVid Multimodal Datasets for Pain Detection

In this work, we use the X-ITE pain database, a multimodal dataset widely employed in pain research and automated pain recognition, particularly using facial expressions and physiological signals [24]. The dataset contains recordings from 134 subjects and includes video data as well as several physiological measurements, such as electrodermal activity (EDA), electrocardiogram (ECG), and surface electromyography (EMG).

Pain stimuli in X-ITE were applied at different intensities, targeting both tonic and phasic receptors using thermal and electrical stimulation protocols. In this study, we focus exclusively on the physiological modalities ECG, EDA, and EMG, which provide information about cardiac activity, skin conductance, and muscle activation, respectively. The EMG signals capture muscle activity from the corrugator supercilii (COR), musculus trapezius (TRA), and zygomaticus major (ZYG). These modalities result in high-dimensional feature representations, COR (82 features), ZYG (82 features), TRA (82 features), EDA (79 features), and ECG (87 features), which are concatenated to form a multimodal feature vector for each measurement. A detailed description of the feature extraction process can be found in [24].

In addition to X-ITE, we also employ the BioVid Heat Pain Database (Part A), a benchmark dataset for physiological pain recognition [25]. BioVid contains recordings from 87 healthy subjects. Pain was induced using subject-specific heat stimulation levels, ranging from baseline (no pain) to very severe pain. Consistent with our experimental design, we use only physiological signals from BioVid and exclude behavioral modalities such as facial video. The dataset includes multiple modalities such as EDA (13 features), ECG (13 features), and EMG (13 features). For our experiments the 13 features were extracted by using simple signal statistics of the raw signal.

Since this work focuses exclusively on binary pain classification, we categorize data from both datasets into two classes: no pain and pain. For X-ITE, this corresponds to the lowest and highest available pain levels, while for BioVid we use the baseline condition (BLN) and the highest pain level (PA4). All intermediate pain levels are excluded, aiming for a clear separation between classes and focusing the analysis on the most distinct physiological responses to pain.

5. Experimental Setup

5.1. Sampling Training Data

After filtering the X-ITE dataset for the lowest and highest pain levels, our database comprises

n = 134

subjects with approximately 60 feature vectors per subject. However, to reduce the computation time and create a data scarcity situation, we first randomly select only 30 subjects and then perform 10 trials for each subject in a scheme similar to cross-validation. For every trial of a given subject, we randomly take 2500 samples without replacement of the data from the remaining subjects

n - 1

(=133) as training data. In this way, we simulate a scenario in which the data may be insufficient, resulting in high model variance. One of our goals is to investigate how Modmix data augmentation can improve model performance in a way suggested by the principle of vicinal risk minimization.

A similar approach is used for BioVid, where the dataset comprises 80 subjects after excluding 7 subjects that do not contain the full set of 40 feature vectors. We evaluate the model on all 80 test subjects and randomly sample 3000 training samples without replacement from the remaining subjects.

5.2. Data Augmentation

We use Modmix as described in Section 3.2.3 to generate new data points for each subject, based on the training examples. Since Modmix can only create a finite number of new samples, the question arises whether this number is large enough. With a training sample size of only 2500 and five modalities, the number of new samples is given by

(\binom{2500}{2}) (2^{6} - 4) \approx 1.8 \cdot 10^{8}

(19)

This shows that Modmix can create a large number of new samples in the order of tens of millions. In practice, only a small fraction of this is needed to achieve sufficient diversity for training. In this study, we will only exchange one modality at a time, leaving us still with

2 (\binom{2500}{2}) (\binom{5}{1}) \approx 3 \cdot 10^{7}

(20)

possible new samples. In the case of X-ITE even altering only one modality already changes ∼20% of the feature vector. For this reason we filter the generated data with a dedicated neural network trained on the original data and keep only the samples that meet a sufficient confidence threshold.

5.3. Neural Network Model

We configured the RBF patch network used in this study with 160 patches of dimension 16. This patch configuration was determined by a superficial grid search. The entire network, including the RBF centers and width parameters, is trained together using the stochastic gradient descent (sgd) method with Nestorov momentum. To reduce the learning time and the number of learnable parameters, we use the upper triangular matrix of the Cholesky decomposition of the covariance matrix of the Mahalanobis distance. We then combine 10 independently trained RBF patch networks with different random patch assignments into a final decision by averaging the outputs of the individual models.

6. Results

6.1. X-ITE

The main focus of this work is on experiments with X-ITE to investigate the proposed RBF patch network together with the Modmix data augmentation. We first performed all experiments with three different data variants: no augmentation, mixup, and Modmix. For a better understanding of the performance of the RBF patch network, all experiments were also conducted using the two related models, RBF-SVM and random forest. In particular, the random forest can be considered a benchmark, as it is one of the most successful, robust, and hard-to-beat ensemble models. In this work, a random forest of depth 2 was used. For each of the 30 test subjects, 10 trials were performed in a leave-one-out scheme, in which the training data in each test were randomly selected from the data excluding the test subject. In addition, both the Modmix and mixup training data contain 500 additional augmented data points each.

Figure 2 shows the average test accuracies of all models as box plots, sorted by training data variant. All models show comparable performance with median values of 85–87%. All model training data combinations show a similar distribution of accuracies, with relatively large interquartile ranges (IRQs) being noticeable. Accordingly, Table 1 shows consistently high relative standard deviations of 10–12%. This is probably due to the fact that test subjects respond differently to different stimuli and are therefore easier or more difficult to classify. This diversity is typical for medical applications.

At 87.1%, the RBF patch network in combination with Modmix shows the best performance of all combinations. However, an increase in the standard deviation of about 1% compared to no augmentation can also be observed. This suggests that Modmix may not lead to improvement in all subjects and may even lead to deterioration in some, which is confirmed by the accuracy-difference plots in Figure 3 for Modmix and mixup. The effect appears to be nearly symmetrical, with the number of improvements roughly matching the number of deteriorations, while some subjects were essentially unaffected.

Table 2 shows a more detailed comparison of the performance of the RBF patch network with the other models using different training data variants. The RBF patch network consistently delivers a statistically significant improvement in accuracy of at least 1% over the other models for all training data variants. Only in comparison with the random forest without augmentation is no improvement observed, although the result is not statistically significant.

Table 3, on the other hand, shows how performance changes when using Modmix in contrast to the other training data variants when using all models. For the RBF patch network, Modmix consistently delivers statistically significantly better results compared to no augmentation (+0.7%) and mixup (+0.8%). This result once again underscores that the RBF patch network improves with Modmix.

6.2. BioVid

To complement the results obtained with X-ITE and to better understand the behavior of the RBF patch network, we performed additional experiments with BioVid without data augmentation. Initial experiments suggested that data augmentation did not lead to consistent performance improvements on BioVid. We therefore focused our analysis on the non-augmented setting to gain unbiased insights into the models’ behavior.

Five runs were performed for each of the 80 test subjects. The graph in Figure 4 shows the average accuracy of both models, sorted in ascending order according to the random forest performance of a test subject. The models follow the same trend; i.e., test subjects who perform well or poorly in random forest also perform well or poorly in the RBF patch network. Interestingly, however, it can be seen that the RBF patch network significantly improves the results for some subjects (increase of approx. 10%), while for others it worsens the results to approximately the same extent. On average, these fluctuations cancel each other out, and both models have the same mean value of about 81% across all trials. These fluctuations are probably due to the significant differences between the test subjects, which were also observed in the experiments with X-ITE. The significant improvements for a subgroup of test subjects are very promising, and it is worth conducting further research to find out exactly what led to the improvements and how the poorer results can also be improved accordingly.

7. Conclusions

In this article, we introduced the novel RBF patch network and combined it with Modmix, an effective yet simple data augmentation strategy. Both the model architecture and the data strategy rely on randomization with the goal of improving performance while reducing model variance. On X-ITE, this approach proved effective, with the proposed method exceeding the accuracy of the extremely strong random forest baseline by almost 1%. At the same time, the results clearly showed that Modmix does not have a uniformly positive effect across all subjects. Given the vast number of possible combinations enabled by Modmix, this behavior leaves room for further investigation of this kind of augmentation strategy.

Additional experiments on the BioVid dataset provided further insights into the behavior of the RBF patch network. In this setting, data augmentation did not yield consistent improvements and was therefore omitted to achieve an analysis. While the RBF patch network and random forest achieved the same mean accuracy across all subjects, there was high variability between them. For a subset of subjects, the RBF patch network led to substantial performance gains, while for others, performance decreased by a similar extent. These effects cancel out on average, but they highlight the variability among subjects that was already observed on X-ITE.

From a modeling perspective, the results confirm that random feature patches can be effectively combined with RBF neurons to capture local similarities across modalities. The subject-dependent performance differences suggest that the proposed architecture highly interacts with individual signal characteristics, which motivates further study. Future work could focus on identifying the factors that lead to performance gains for some subjects and on adapting the model to all subjects accordingly. Furthermore, the way the patches are sampled and combined offers several unexplored design choices. The potential of stacking multiple patch layers remains an open and promising direction for future research.

Author Contributions

Conceptualization, M.E., F.S., S.W. and S.G.; methodology, M.E.; software, M.E.; validation, M.E. and F.S.; formal analysis, F.S.; investigation, M.E., S.W. and S.G.; writing—original draft preparation, M.E.; writing—review and editing, M.E., F.S., S.W. and S.G.; supervision, F.S. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset generated and analyzed in this work is available through the X-ITE Pain Database at Otto von Guericke University Magdeburg: https://www.nit.ovgu.de/nit/en/XITE+Pain-p-1714.html (accessed on 10 February 2026). The BioVid dataset used in this study is publicly available. Researchers can request access at BioVid official website from Otto von Guericke University Magdeburg: https://www.nit.ovgu.de/nit/en/BioVid-p-1358.html (accessed on 10 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Zhang, H.; Cissé, M.; Dauphin, Y.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Erdal, M.; Gruss, S.; Walter, S.; Schwenker, F. ModMix: Data Augmentation for Multimodal Pain Detection. In Proceedings of the Pattern Recognition. ICPR 2024 International Workshops and Challenges, Cham, Switzerland, 2025; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 145–155. [Google Scholar]
Lucey, P.; Cohn, J.F.; Prkachin, K.M.; Solomon, P.E.; Matthews, I. Painful data: The UNBC-McMaster Shoulder Pain Expression Archive Database. In Proceedings of the 2011 IEEE International Conference on Automatic Face and Gesture Recognition (FG), Santa Barbara, CA, USA; IEEE: Piscataway, NJ, USA, 2011; pp. 57–64. [Google Scholar]
Lu, Y.; Ozek, A.; Kamarthi, S. Transformer Encoder with Multiscale Deep Learning for Pain Classification Using Physiological Signals (PainAttnNet). Front. Physiol. 2023, 14, 1294577. [Google Scholar] [CrossRef] [PubMed]
Gu, X.; Wang, Z.; Jin, I.; Wu, Z. Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives. arXiv 2024, arXiv:2404.00320. [Google Scholar] [CrossRef]
Wu, C.-L.; Liu, S.-F.; Yu, T.-L.; Shih, S.-J.; Chang, C.-H.; Yang Mao, S.-F.; Li, Y.-S.; Chen, H.-J.; Chen, C.-C.; Chao, W.-C. Deep Learning-Based Pain Classifier Based on the Facial Expression in Critically Ill Patients. Front. Med. 2022, 9, 851690. [Google Scholar] [CrossRef] [PubMed]
Alphonse, S.; Abinaya, S.; Kumar, N. Pain Assessment from Facial Expression Images Utilizing Statistical Frei-Chen Mask (SFCM)-Based Features and DenseNet. J. Cloud Comput. 2024, 13, 142. [Google Scholar] [CrossRef]
Ge, H.; Zhu, Z.; Dai, Y.; Wang, B.; Wu, X. Facial Expression Recognition Based on Deep Learning. Comput. Methods Programs Biomed. 2022, 215, 106621. [Google Scholar] [CrossRef] [PubMed]
Wurzberger, F.; Schwenker, F. Learning in Deep Radial Basis Function Networks. Entropy 2024, 26, 368. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Nienaber, S.; Dinges, L.; Al-Hamadi, A. AI-based bi-modal fusion system for automated clinical pain monitoring. Comput. Biol. Med. 2025, 198, 111260. [Google Scholar] [CrossRef] [PubMed]
Ben Aoun, N. A Review of Automatic Pain Assessment from Facial Information Using Machine Learning. Technologies 2024, 12, 92. [Google Scholar] [CrossRef]
Cheng, Z.; Cheng, Z.-Q.; He, J.-Y.; Sun, J.; Wang, K.; Lin, Y.; Lian, Z.; Peng, X.; Hauptmann, A.G. Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Farmani, J.; Bargshady, G.; Gkikas, S.; Tsiknakis, M.; Fernandez Rojas, R. A CrossMod-Transformer deep learning framework for multi-modal pain detection through EDA and ECG fusion. Sci. Rep. 2025, 15, 29467. [Google Scholar] [CrossRef] [PubMed]
Yun, S.; Han, D.; Chun, S.; Oh, S.; Yoo, Y.; Choe, J. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea; IEEE: Piscataway, NJ, USA, 2019; pp. 6022–6031. [Google Scholar]
Vapnik, V.N. Principles of Risk Minimization for Learning Theory. In Proceedings of the 4th International Conference on Neural Information Processing Systems (NIPS), Denver, CO, USA; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1991; pp. 831–838. [Google Scholar]
Chapelle, O.; Weston, J.; Bottou, L.; Vapnik, V. Vicinal Risk Minimization. In Proceedings of the 13th Advances in Neural Information Processing Systems (NIPS); MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
Cheng, Z.; Jiang, Z.; Yin, Y.; Wang, C.; Gu, Q. Learning to Classify Open Intent via Soft Labeling and Manifold Mixup. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 635–645. [Google Scholar] [CrossRef]
Collins, K.M.; Bhatt, U.; Liu, W.; Piratla, V.; Sucholutsky, I.; Love, B.; Weller, A. Human-in-the-Loop Mixup. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI ’23), Pittsburgh, PA, USA, 31 July–4 August 2023. [Google Scholar]
Zhang, G.; Zhao, Y. Target-Directed MixUp for Labeling Tangut Characters. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 202–207. [Google Scholar]
Sohn, J.-Y.; Shang, L.; Chen, H.; Moon, J.; Papailiopoulos, D.; Lee, K. GenLabel: Mixup Relabeling using Generative Models. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), Volume 162 of Proceedings of Machine Learning Research, Baltimore, MD, USA, 17–23 July 2022; pp. 20278–20313. [Google Scholar]
Zhang, S.; Chen, C.; Zhang, X.; Peng, S. Label-Occurrence-Balanced Mixup for Long-Tailed Recognition. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 3224–3228. [Google Scholar]
Chidambaram, M.; Ge, R. For Better or For Worse? Learning Minimum Variance Features with Label Augmentation. arXiv 2025, arXiv:2402.06855. [Google Scholar]
Gruss, S.; Geiger, M.; Werner, P.; Wilhelm, O.; Traue, H.; Al-Hamadi, A.; Walter, S. Multi-Modal Signals for Analyzing Pain Responses to Thermal and Electrical Stimuli. J. Vis. Exp. 2019, e59057. [Google Scholar] [CrossRef] [PubMed]
Walter, S.; Gruss, S.; Ehleiter, H.; Tan, J.; Traue, H.C.; Crawcour, S.; Werner, P.; Al-Hamadi, A.; Limbrecht-Ecklundt, K. The BioVid Heat Pain Database Data for the Advancement and Systematic Validation of an Automated Pain Recognition System. In Proceedings of the 2013 IEEE International Conference on Cybernetics (CYBCO), Lausanne, Switzerland, 13–15 June 2013; pp. 128–131. [Google Scholar]

Figure 1. Example architecture of the RBF patch network. The graph illustrates the core architecture where input from four modalities is processed. The features of each modality are randomly assigned to four RBF neurons. However this assignment is fixed and part of the model.

Figure 2. Box plots of the mean test accuracies for each model tested, categorized according to the training data variant used. In total, 30 subjects were tested in a leave-one-out scheme with different training data samples. Ten trials were performed for each subject; the mean test accuracy is the mean value of these trials. The boxes represent the interquartile range (IRQ) beginning at the first quartile and ending at the third quartile. The line within the box is the median and the whiskers extend the boxes by 1.5 times the IRQ. Outliers are marked as circles.

Figure 3. Per-subject differences in mean test accuracy (

Δ

accuracy) relative to the non-augmented (No Aug) baseline. Each point represents one subject sorted by

Δ

accuracy. The left panel compares Modmix to no No Aug (mean change

= + 0.67 %

), with 12 subjects showing improvement (↑), 7 showing no notable change (≈), and 11 showing a decrease (↓). The right panel compares mixup to No Aug (mean change

= - 0.11 %

), with 9 subjects improving, 6 unchanged, and 15 decreasing. Dashed horizontal lines mark the baseline (

Δ = 0

).

Figure 3. Per-subject differences in mean test accuracy (

Δ

accuracy) relative to the non-augmented (No Aug) baseline. Each point represents one subject sorted by

Δ

accuracy. The left panel compares Modmix to no No Aug (mean change

= + 0.67 %

), with 12 subjects showing improvement (↑), 7 showing no notable change (≈), and 11 showing a decrease (↓). The right panel compares mixup to No Aug (mean change

= - 0.11 %

), with 9 subjects improving, 6 unchanged, and 15 decreasing. Dashed horizontal lines mark the baseline (

Δ = 0

).

Figure 4. Comparison of classification performance between random forest and RBF patch network models across individual subjects. Mean accuracy over 5 trials is plotted for 80 subjects, sorted by random forest performance.

Table 1. Model accuracies (%) (mean ± std) for each dataset.

Dataset	RBF-Patch-Net	RBF-SVM	Random Forest
No Aug	86.4 ± 10.3	83.7 ± 12.4	86.4 ± 10.7
Modmix	87.1 ± 11.3	85.0 ± 11.6	84.6 ± 12.3
Mixup	86.3 ± 10.7	84.8 ± 11.8	85.2 ± 11.7

Table 2. Performance comparison of the RBF network against other models under different data augmentation strategies.

Δ

accuracy is computed as (RBF network mean accuracy) − (mean accuracy of the compared model). The p-value was determined using a Wilcoxon signed-rank test, and the confidence interval (CI) was calculated based on a t-distribution.

Table 2. Performance comparison of the RBF network against other models under different data augmentation strategies.

Δ

accuracy is computed as (RBF network mean accuracy) − (mean accuracy of the compared model). The p-value was determined using a Wilcoxon signed-rank test, and the confidence interval (CI) was calculated based on a t-distribution.

Dataset	Model	$Δ$ Acc. (%)	p-Value	95% CI
No Augmentation	RBF-SVM	+2.7	$p < 10^{- 6}$	[1.8, 3.6]
No Augmentation	Random Forest	$- 0.03$	0.479	[ $- 0.6$ , 0.5]
Modmix	RBF-SVM	+2.3	$p < 10^{- 6}$	[1.6, 3.0]
Modmix	Random Forest	+2.4	$p < 10^{- 6}$	[1.6, 3.2]
Mixup	RBF-SVM	+1.3	0.044	[0.6, 2.0]
Mixup	Random Forest	+1.1	0.033	[0.3, 1.8]

Table 3. Performance comparison of data augmentation strategies using Modmix as the baseline.

Δ

accuracy is computed as (Modmix mean accuracy) − (mean accuracy of the compared dataset). The p-value was determined using a Wilcoxon signed-rank test, and the confidence interval (CI) was calculated based on a t-distribution.

Table 3. Performance comparison of data augmentation strategies using Modmix as the baseline.

Δ

accuracy is computed as (Modmix mean accuracy) − (mean accuracy of the compared dataset). The p-value was determined using a Wilcoxon signed-rank test, and the confidence interval (CI) was calculated based on a t-distribution.

Model	Dataset	$Δ$ Acc. (%)	p-Value	95% CI
RBF-Patch-Net	No Augmentation	+0.7	0.004	[0.2, 1.1]
RBF-Patch-Net	Mixup	+0.8	0.007	[0.4, 1.2]
RBF-SVM	No Augmentation	+1.1	0.036	[0.2, 2.0]
RBF-SVM	Mixup	$- 0.2$	$p < 10^{- 5}$	[ $- 0.4$ , $- 0.1$ ]
Random Forest	No Augmentation	$- 1.8$	$p < 10^{- 4}$	[ $- 2.5$ , $- 1.1$ ]
Random Forest	Mixup	$- 0.6$	0.042	[ $- 1.0$ , $- 0.1$ ]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Erdal, M.; Gruss, S.; Walter, S.; Schwenker, F. Randomized Modality Mixing with Patchwise RBF Networks for Robust Multimodal Pain Recognition. Computers 2026, 15, 127. https://doi.org/10.3390/computers15020127

AMA Style

Erdal M, Gruss S, Walter S, Schwenker F. Randomized Modality Mixing with Patchwise RBF Networks for Robust Multimodal Pain Recognition. Computers. 2026; 15(2):127. https://doi.org/10.3390/computers15020127

Chicago/Turabian Style

Erdal, Mehmet, Sascha Gruss, Steffen Walter, and Friedhelm Schwenker. 2026. "Randomized Modality Mixing with Patchwise RBF Networks for Robust Multimodal Pain Recognition" Computers 15, no. 2: 127. https://doi.org/10.3390/computers15020127

APA Style

Erdal, M., Gruss, S., Walter, S., & Schwenker, F. (2026). Randomized Modality Mixing with Patchwise RBF Networks for Robust Multimodal Pain Recognition. Computers, 15(2), 127. https://doi.org/10.3390/computers15020127

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Randomized Modality Mixing with Patchwise RBF Networks for Robust Multimodal Pain Recognition^†

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. RBF Patch Network

3.1.1. Mahalanobis Distance

3.1.2. Patch Sampling

3.1.3. Relation to RBF-SVMs

3.1.4. Relation to Random Forests

3.2. Data Augmentation with Modmix

3.2.1. Vicinal Risk Minimization

3.2.2. Mixup

3.2.3. Modmix

4. X-ITE and BioVid Multimodal Datasets for Pain Detection

5. Experimental Setup

5.1. Sampling Training Data

5.2. Data Augmentation

5.3. Neural Network Model

6. Results

6.1. X-ITE

6.2. BioVid

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Randomized Modality Mixing with Patchwise RBF Networks for Robust Multimodal Pain Recognition †

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. RBF Patch Network

3.1.1. Mahalanobis Distance

3.1.2. Patch Sampling

3.1.3. Relation to RBF-SVMs

3.1.4. Relation to Random Forests

3.2. Data Augmentation with Modmix

3.2.1. Vicinal Risk Minimization

3.2.2. Mixup

3.2.3. Modmix

4. X-ITE and BioVid Multimodal Datasets for Pain Detection

5. Experimental Setup

5.1. Sampling Training Data

5.2. Data Augmentation

5.3. Neural Network Model

6. Results

6.1. X-ITE

6.2. BioVid

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Randomized Modality Mixing with Patchwise RBF Networks for Robust Multimodal Pain Recognition^†