MBS: A Modality-Balanced Strategy for Multimodal Sample Selection

Xu, Yuntao; Chen, Bing; Hu, Feng; Liu, Jiawei; Zhao, Changjie; Wu, Hongtao

doi:10.3390/make8010017

Open AccessArticle

MBS: A Modality-Balanced Strategy for Multimodal Sample Selection

by

Yuntao Xu

^*

,

Bing Chen

^*

,

Feng Hu

,

Jiawei Liu

,

Changjie Zhao

and

Hongtao Wu

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing 211106, China

^*

Authors to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(1), 17; https://doi.org/10.3390/make8010017

Submission received: 26 November 2025 / Revised: 5 January 2026 / Accepted: 6 January 2026 / Published: 8 January 2026

(This article belongs to the Section Learning)

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of applications such as edge computing, the Internet of Things (IoT), and embodied intelligence, massive multimodal data are continuously generated on end devices in a streaming manner. To maintain model adaptability and robustness in dynamic environments, incremental learning has gradually become the core training paradigm on edge devices. However, edge devices are constrained by limited computational, storage, and communication resources, making it infeasible to retain and process all data samples over time. This necessitates efficient data selection strategies to reduce redundancy and improve training efficiency. Existing sample selection methods primarily focus on overall sample difficulty or gradient contribution, but they overlook the heterogeneity of multimodal data in terms of information content and discriminative power. This often leads to modality imbalance, causing the model to over-rely on a single modality and suffer performance degradation. To address this issue, this paper proposes a multimodal sample selection strategy based on the Modality Balance Score (MBS). The method computes confidence scores at the modality level for each sample and further quantifies the contribution differences across modalities. In the selection process, samples with balanced modality contributions are prioritized, thereby improving training efficiency while alleviating modality bias. Experiments conducted on two benchmark datasets, CREMA-D and AVE, demonstrate that compared with existing approaches, the MBS strategy achieves the most stable performance under medium-to-high selection ratios (0.25–0.4), yielding superior results in both accuracy and robustness. These findings validate the effectiveness of the proposed strategy in resource-constrained scenarios, providing both theoretical insights and practical guidance for multimodal sample selection in learning tasks.

Keywords:

multimodal learning; sample selection; modality balance; incremental learning

Graphical Abstract

1. Introduction

With the rapid development of applications such as edge computing, the Internet of Things (IoT), and embodied intelligence, intelligent systems are gradually shifting from centralized cloud computing toward more distributed and real-time deployment paradigms. In addition to relying on the cloud for centralized model training and updates, an increasing number of tasks require incremental local training on edge devices to quickly adapt to dynamic environments; meanwhile, inference and decision-making must be executed directly on end devices to meet the demands of low latency, privacy preservation, and energy efficiency. This collaborative mode of “training on edge devices + inference on end devices” enables intelligent systems to achieve continual learning and rapid response under resource constraints, but it also imposes stricter requirements on data processing efficiency, model lightweighting, and sample management. In such scenarios, massive multimodal data from sensors, cameras, and interactive devices are continuously generated in a streaming manner, encompassing both high-dimensional information such as vision and speech, and low-dimensional signals such as environmental perception. These data provide rich semantic support for continual learning and adaptation of intelligent systems, but they also introduce new challenges: on the one hand, the computational, storage, and communication resources of edge devices and end nodes are limited, making it difficult to sustain long-term storage and training on full datasets; on the other hand, the data distribution evolves dynamically over time, requiring models to be continuously updated in order to maintain stable performance and generalization in dynamic environments. Therefore, in resource-constrained intelligent systems, how to efficiently process continuously arriving multimodal streaming data becomes a core challenge.

Traditional deep learning relies on one-time large-scale data training, where powerful computational resources and massive datasets enable the development of high-performance models that have achieved breakthrough progress in tasks such as image recognition, speech understanding, and natural language processing. However, this centralized, full-data-driven paradigm struggles to adapt to the continuous changes in data distribution under dynamic environments. Once trained, models remain relatively fixed and lack the ability for real-time updates. In contrast, Incremental Learning can continuously update models as new data arrive, thereby avoiding catastrophic forgetting and better aligning with the needs of distributed architectures. Particularly on edge devices, lightweight models can be incrementally trained to quickly adapt to local environmental changes, thus supporting inference and real-time decision-making on end devices. Nevertheless, edge devices are constrained by limited computational and storage resources, making it infeasible to retain all historical samples over the long term. This raises a critical task: how to correctly and efficiently select the most valuable data samples under limited resources.

Taking a UAV swarm system as an example, as illustrated in Figure 1, the UAVs continuously collect images, audio, and environmental sensor data, which are uploaded to edge nodes for incremental training. However, due to limitations in storage capacity, communication bandwidth, and edge computing power, it is infeasible to store all streaming data in full or to process all multimodal information in a timely manner. More importantly, different modalities exhibit significant disparities in data volume, semantic richness, and noise levels—for instance, the image modality typically contains more detailed semantic information, while audio or environmental sensing modalities may be sparser or more susceptible to interference. These differences can lead to modality imbalance during training, where the model gradually becomes biased toward the “dominant modality” with richer information, while neglecting the “weaker modalities.” This imbalance not only hinders effective cross-modal fusion, but also introduces prediction bias and degrades overall model performance. In resource-constrained edge environments, such issues are further exacerbated. This representative application scenario highlights a more general technical issue: how to design a sample selection strategy in resource-constrained incremental learning environments that balances efficiency and modality fairness. Such a strategy must not only reduce the overhead caused by redundant data under limited computational and storage conditions, but also correctly and efficiently identify and retain the samples most valuable for model convergence, thereby preventing the model from over-relying on a single modality and ensuring that training efficiency is maintained alongside cross-modal balance and generalization capability.

Existing sample selection methods have shown some effectiveness in mitigating data redundancy under resource-constrained conditions. Typically, these approaches are based on sample difficulty or gradient contribution, and can significantly improve training efficiency in single-modality or homogeneous data scenarios. However, their limitations become increasingly apparent when applied to multimodal incremental learning. The phenomenon of modality imbalance not only weakens cross-modal information fusion, but may also lead to long-term model bias and performance degradation. More critically, in resource-constrained edge environments, if the selection strategy fails to account for modality-level balance, such bias can be further amplified, making it difficult for the model to maintain stable and robust performance in dynamic tasks.

This highlights the urgent need for a new sample selection paradigm—one that can simultaneously balance efficiency and modality fairness under limited resources, and correctly and efficiently identify the samples most valuable for model convergence. To this end, we propose a modality-balanced multimodal sample selection strategy. This strategy operates in an online manner: either during early training or continuously, it computes the Modality Balance Score (MBS) for each data sample to quantify the relative contributions of dominant and weaker modalities to model convergence. Based on the distribution of MBS values, the strategy selects the subset of samples for subsequent training.

Specifically, we construct a non-parametric classifier based on a prototypical network, where the confidence score at the modality level is computed by measuring the distance between each modality-specific feature and its corresponding class prototype. This score characterizes the discriminative strength of each modality for the current sample and reflects the model’s reliance on that modality’s information. Take a dual-modality audio-visual learning task as an example: if a sample’s confidence score in the audio modality is significantly higher than that in the visual modality, it indicates that the sample is already highly discriminative in the audio modality, while its contribution from the visual modality is limited—and vice versa. However, such intuitive comparisons are insufficient for systematically assessing the balance of modality contributions across samples. To address this, we introduce MBS, which normalizes the confidence differences across modalities to produce a unified metric that reflects the degree of modality balance for each sample. Based on this score, we can perform real-time selection of samples with relatively balanced modality contributions, prioritizing them for subsequent incremental model updates. This sample selection strategy not only prevents the model from over-relying on a single dominant modality, but also improves learning efficiency under constrained computational and storage resources.

In summary, although existing research has made progress in sample selection under resource-constrained conditions, most approaches focus solely on the overall importance of samples, overlooking the disparities in information content and discriminative power across different modalities. This oversight can be detrimental in multimodal scenarios: it causes models to gradually favor a single dominant modality during long-term learning, thereby amplifying modality imbalance, weakening cross-modal fusion, and ultimately degrading performance. This challenge is particularly pronounced in incremental learning on edge devices, where limited computational and storage resources make it difficult to simultaneously ensure training efficiency and modality fairness. To address this gap, we propose a modality-centric perspective and introduce MBS. Based on MBS, we design a sample selection strategy that guides the selection process by quantifying the contribution disparities across modalities. This enables more efficient and robust multimodal incremental learning in dynamic, resource-constrained environments.

The main contributions of this work are summarized as follows:

(1): We introduce the concept and computation of the Modality Balance Score (MBS), which effectively quantifies the contribution of multimodal samples to model convergence during training;
(2): Based on MBS, we propose a sample selection strategy and demonstrate that it can identify effective samples at earlier stages of training, while exhibiting robustness under larger selection ratios;
(3): We provide extensive experimental validation, comparing our approach with two other representative sample selection strategies on two multimodal datasets. These results offer useful references for researchers with different preferences in choosing sample selection strategies.

2. Related Work

2.1. Incremental Learning in Resource-Constrained Environments

Deploying deep learning models in resource-constrained edge environments faces multiple challenges, including limited computational capacity, storage space, and energy budgets. Traditional large-scale neural networks (e.g., ResNet, Transformer) struggle to meet real-time and energy-efficiency requirements on edge devices. Against this backdrop, extensive research has focused on lightweight network design and inference optimization, spanning multiple levels from model architecture to system implementation. Liu et al. [1] presented a systematic survey that comprehensively reviewed the progress of lightweight deep learning in resource-constrained environments, covering lightweight network architectures (e.g., Mobile Net, Tiny BERT), model compression techniques (pruning, quantization, distillation), and hardware-aware neural architecture search and compiler optimization mechanisms. Their work emphasized the importance of co-design across algorithms and systems to improve the efficiency of edge intelligence deployment. Meanwhile, Shuvo et al. [2] conducted a survey focusing on inference acceleration on edge devices, systematically summarizing mainstream optimization strategies across the model, compiler, and hardware layers, and highlighting multi-level collaborative optimization as the key pathway to efficient edge inference. To adapt to the computational limitations of edge devices, researchers have proposed various lightweight network architectures. Squeeze Net [3] significantly reduces parameter counts through “fire modules”; Shuffle Net [4] improves information flow efficiency via channel shuffle mechanisms; Efficient Net [5] achieves a balance between accuracy and efficiency using compound scaling strategies. More recently, lightweight Transformer models such as Mobile ViT [6] and Mobile Former [7] have been introduced to support multimodal and sequence modeling tasks. These approaches provide feasible architectural foundations for real-time inference on edge devices.

Beyond architectural design, model compression techniques are equally critical for edge deployment. Han et al. [8] pioneered deep compression through unstructured pruning and weight quantization, though the resulting sparsity offered limited acceleration on general-purpose hardware. Molchanov et al. [9] proposed more refined pruning criteria based on Taylor expansion, while Lee et al. [10] introduced the SNIP method, which efficiently identifies redundant parameters at early training stages. In parallel, quantization methods (e.g., Jacob et al. [11] with 8-bit quantization) and knowledge distillation [12] have been widely adopted to further reduce model complexity. At the hardware level, specialized accelerators such as FPGA, ASIC, and TPU have been extensively employed to enhance inference efficiency [13]. Meanwhile, edge inference paradigms continue to evolve, including fully local inference, edge-server inference, and hierarchical collaborative inference [14]. These approaches provide different trade-offs among latency, energy consumption, and privacy preservation.

In resource-constrained edge environments, model deployment faces multiple challenges such as limited computational power and storage capacity, which have motivated researchers to explore lightweight architectures and efficient inference mechanisms. However, relying solely on architectural optimization remains insufficient to meet the demands of continual learning, especially in scenarios where multimodal streaming data continuously arrive. Incremental learning has gradually emerged as a key pathway to enhance model adaptability and robustness. To support stable model updates in dynamic environments, beyond architectural and system-level optimizations, recent studies have also begun to focus on how data-level selection and pruning can improve training efficiency and generalization in incremental learning scenarios.

Two recent authoritative surveys [15,16] systematically reviewed the development of incremental learning from theoretical, methodological, and application perspectives, emphasizing the central role of memory management and data selection in improving training efficiency, enhancing generalization, and mitigating catastrophic forgetting. Traditional approaches often borrow sample selection strategies from single-modality incremental learning, such as the herding method [17], which selects centroid samples in the feature space to preserve class distribution. However, single-modality representativeness is insufficient to ensure cross-modal consistency. To address this, researchers have proposed joint embedding-based selection method [18], which leverage cross-modal aligned embedding spaces to improve the semantic consistency of replay samples. More recently, researchers have attempted to refine data selection strategies through uncertainty modeling and contrastive learning. For example, Ding et al. [19] proposed Uncertainty-Aware Contrastive Learning (UACL) for semi-supervised classification of multimodal remote sensing images. By distinguishing between high-confidence and low-confidence samples, they adopted “hard” and “soft” contrastive learning strategies, respectively, thereby fully exploiting multimodal data under limited annotation conditions. This idea inspires data selection in incremental learning: beyond representativeness, uncertainty information should also be incorporated to dynamically adjust sample importance. Meanwhile, related research has drawn insights from single-modality data pruning and federated learning scenarios. Paul et al. [20] introduced the “data diet” method, demonstrating that identifying and retaining key samples early in training can significantly reduce dataset size without sacrificing performance. Yang et al. [21] proposed a comprehensive data pruning framework for person re-identification tasks, showcasing the potential of data selection to improve generalization and efficiency. Similarly, Gong et al. [22] investigated online data selection under limited storage in federated learning, proposing dynamic selection strategies to balance storage and performance. These ideas provide important references for memory management in multimodal incremental learning.

In multimodal learning scenarios, Peng et al. [23] proposed the Balanced Multimodal Learning framework, which achieves dynamic balance among modalities through gradient modulation. Although its core contribution lies in fusion strategies, it indirectly inspires sample selection: in incremental learning, priority should be given to retaining samples that promote modality balance and complementarity. Similarly, Fan et al. [24] introduced the Prototypical Modal Rebalance (PMR) method, which dynamically adjusts modality learning speed by incorporating class prototypes, effectively alleviating modality imbalance and offering a new perspective for sample selection in memory management. Recent studies have further expanded the boundaries of data selection. For example, Mahmoud et al. [25] proposed the Sieve method, which leverages image captioning models to generate semantic proxy texts and prunes image–text pairs based on semantic consistency evaluation, significantly improving the quality of multimodal pre-training data. Ye et al. [26] introduced the Fit and Prune strategy during inference in multimodal large language models, employing a training-independent visual token pruning mechanism that substantially reduces computational overhead while maintaining performance. These approaches enrich the technical pathways of sample selection in multimodal learning from perspectives such as semantic alignment, modality redundancy compression, and inference efficiency.

Overall, achieving efficient incremental learning in resource-constrained edge environments faces multiple challenges, ranging from model architecture to data management. Existing research has proposed diverse technical pathways in lightweight architectures, inference optimization, memory management, and sample selection, covering dimensions such as representativeness, difficulty, uncertainty, complementarity, and modality balance. Although these methods have made encouraging progress in mitigating catastrophic forgetting, improving training efficiency, and enhancing generalization, in dynamic tasks where multimodal streaming data continuously arrive, significant disparities remain across modalities in terms of data scale, semantic richness, and noise levels. How to achieve adaptive selection and joint optimization tailored to modality differences remains an urgent and unresolved challenge.

2.2. Data Selection Based on Quantified Sample Attributes

In resource-constrained scenarios, data-level selection and pruning have become important means to improve the training efficiency and generalization capability of incremental learning. As a key component of this research direction, sample selection strategies have gradually evolved from heuristic approaches to score-based mechanisms. The core idea is to quantify critical sample attributes—such as representativeness, uncertainty, and modality contribution—in order to assist in evaluating their value and retention priority during model updates.

For quantifying the importance of training samples, Paul et al. [20] proposed two representative metrics: the GraNd (Gradient Normed) score and the EL2N (Error L2-Norm) score, which have been used as baselines or Crucial References in recent works on continual learning and data pruning [21,22]. The former measures a sample’s contribution to model learning based on the gradient norm of its parameter updates, while the latter evaluates sample difficulty by computing the L2 distance between the predicted distribution and the ground-truth label. Their mathematical definitions are as follows:

(1): GraNd score:

GraNd (x_{i}, y_{i}) = {∥ \nabla_{θ} l (x_{i}, y_{i}; θ) ∥}_{2}

(1)

where

l (x_{i}, y_{i}; θ)

is the cross-entropy loss function. This score reflects the driving force of the samples on gradient updates under the current parameters. The larger the gradient norm, the more significant the impact of the samples on model training.

(2): EL2N score:

EL 2 N (x_{i}, y_{i}) = ∥ p_{θ} (x_{i}) - y_{i} ∥_{2}

(2)

where

p_{θ} (x_{i})

represents the prediction probability distribution of the model for samples

x_{i}

under parameter

θ

, and

y_{i}

is the one-hot label vector. The higher the score, the more difficult it is to learn and the higher the importance of the sample.

Both metrics can be computed in the early stages of training and are widely used for data pruning and subset selection. They help reduce training overhead while preserving model performance, playing a crucial role in improving the efficiency and stability of incremental learning.

Furthermore, in multimodal streaming tasks, the disparity in information content and discriminative power across modalities makes it difficult for traditional selection criteria to balance modality fairness and selection efficiency. As a result, modality balance quantification has become one of the key guiding principles for sample selection. In this regard, Fan et al. [24] proposed the Prototypical Modal Rebalance (PMR) method, which systematically models the common issue of modality imbalance in multimodal learning. The core idea is to compute class prototypes within each unimodal representation space and construct a non-parametric classifier to independently assess the learning effectiveness of each modality. Based on the distribution of distances between samples and their corresponding prototypes, the method designs metrics to characterize the learning progress disparity across modalities. Building on this, the authors introduced the Prototypical Cross-Entropy (PCE) loss, which enhances clustering in weaker modalities by minimizing the distance between samples and their class prototypes. In parallel, they proposed Prototypical Entropy Regularization (PER) to suppress overly rapid convergence in dominant modalities, thereby achieving dynamic balance across modalities. This framework provides a general solution for quantifying and regulating modality balance, independent of specific model architectures or fusion strategies. It effectively mitigates performance bottlenecks caused by modality imbalance and improves the robustness and generalization of multimodal learning.

Inspired by the impact of modality imbalance on model performance in multimodal incremental learning, this work further proposes a sample selection strategy based on modality balance quantification, aiming to identify samples with more balanced modality contributions at early training stages.

3. Basic Model

Our work considers a dual-modality (audio-visual) incremental learning paradigm. We further analyze the sample selection requirements for multimodal streaming data under resource-constrained conditions, and formulate an optimization objective that provides a theoretical foundation for the design of subsequent selection strategies.

3.1. Dual-Modality Incremental Learning Paradigm

Consider a dual-modal (audio, video) incremental learning paradigm, which defines a sample reached in the time step

t

as a binary group

(x_{t}, y_{t})

:

x_{t} = ((x_{t}^{a}, x_{t}^{v}), y_{t})

(3)

where

x_{t}^{a} \in R^{a}

represents audio modal characteristics, such as Mel-spectrogram;

x_{t}^{v} \in R^{v}

represents video modal characteristics, usually image frames; and

y_{t} \in Y_{t}

represents sample label. From this, two types of encoded feature representations can be defined:

z_{t}^{a} = {E n}_{a} (x_{t}^{a}), z_{t}^{v} = {E n}_{v} (x_{t}^{v})

(4)

As is shown in Figure 2,

{E n}_{a} : R^{a} \to R^{m}

consists of MFCC feature extraction followed by a recurrent neural network (RNN) to capture temporal dynamics in the audio stream, and

{E n}_{v} : R^{v} \to R^{m}

consists of a MobileNet_V2 backbone followed by an RNN to encode visual expressions over time. Their role is to map the features of their respective modes into a unique multimodal feature space. These modality-specific embeddings are then fused to obtain a joint representation:

z_{t}^{m} = f_{θ} (z_{t}^{a}, z_{t}^{v})

(5)

Here

z_{t}^{m} \in R^{m}

represents the fused multimodal features. Among them,

f_{θ}

can be a variety of different feature fusion methods, including feature summation, feature concatenation, Feature-wise linear Modulation (FiLM), etc. After feature fusion, the classifier

g

outputs the category probability

P_{t}

:

P_{t} = s o f t m a x (g (z_{t}^{m}))

(6)

A basic objective function can be defined for incremental learning on audio and video bimodal data samples. The core idea is cross-entropy + knowledge distillation:

L_{t} = \frac{1}{B_{t}} \sum_{(x, y) \in B_{t}} L_{C E} (P (x), y) + \frac{λ_{K D}}{M_{t}} \sum_{(x, y) \in M_{t}} L_{K D} (P (x), P_{p r e} (x))

(7)

B_{t}

represents the current sample batch,

M_{t}

represents a limited sample playback buffer,

L_{C E}

represents the cross-entropy loss,

L_{K D}

represents the distillation loss that retains the knowledge of the previous task,

P_{p r e}

represents the model output of the previous stage, and

λ_{K D}

is a hyperparameter used to control the knowledge distillation loss.

To support incremental learning, class prototypes for each modality are updated online. For a new sample

(x_{t}, y_{t})

, the prototype of class

y_{t}

is updated using an exponential moving average (EMA):

c_{y_{t}}^{| m |} \leftarrow α c_{y_{t}}^{| m |} + (1 - α) z_{t}^{| m |}

(8)

where

| m | \in {a, v}

denotes the modality and α controls the update momentum. This strategy avoids storing all past samples and ensures stable prototype estimation under streaming data.

3.2. Optimization Objective

Given the large data scale, significant modality-specific disparities, and resource constraints, retaining all raw data indiscriminately during incremental learning would rapidly exhaust computational and storage resources, degrade training efficiency, and increase inference latency. Therefore, our optimization objective focuses on identifying and retaining the most convergence-critical samples as early as possible during edge-side incremental training, under limited computational and storage conditions. This is achieved through an effective sample selection strategy that also ensures balanced modality contributions, thereby improving training efficiency, convergence quality, and robustness across varying selection ratios.

Figure 3 illustrates the basic logic of data selection in the context of multimodal incremental learning. To facilitate our study, we treat the notion of “early training” as an independent pre-training process. This design allows us to flexibly set different numbers of pre-training epochs to observe sample attributes at various stages of early training. Our goal is to identify and retain, through the selection strategy

μ

, the samples most valuable for model convergence within as few training iterations as possible in the current batch. At the same time, strategy

μ

should exhibit strong robustness—namely, even under more stringent resource-constrained conditions (where fewer samples are retained for training), it should still maintain model accuracy at a reasonable level.

Define the minimization goal

J

:

m i n J (μ, ρ, R) = L (B_{t}^{μ}) + λ_{1} \cdot C_{e f f} (R) + λ_{2} \cdot C_{r o b} (μ, ρ)

(9)

where

L (B_{t}^{μ})

represents the training loss on the subset of samples in the current batch filtered by the policy

μ (ρ, R)

;

C_{e f f} (R)

represents the efficiency cost caused by the pre-training rounds, which reflects “whether the target performance can be achieved in as few training rounds as possible”;

C_{r o b} (μ, ρ)

represents the robustness cost, which characterizes the stability and adaptability of the policy under different screening ratios, especially whether reasonable performance can be maintained under extreme resource constraints (large

ρ

);

λ_{1}, λ_{2}

represent the penalty weights. In short, this optimization goal aims to quickly find the most valuable and stable samples to the model in the early stage of training through a reasonable sample screening strategy, with as few training rounds as possible and limited resources, thereby considering convergence speed, efficiency and robustness.

4. Main Method

Starting from the contribution disparities across modalities, we design a modality-balanced sample selection method that identifies and retains samples with more balanced modality contributions during the early stages of training. This approach enhances both the convergence speed and the long-term stability of the model.

4.1. Modality Balance Quantification: MBS

In Equation (3) presented in the previous section, we discussed the fused feature output of the two modalities. This process can be further expressed in detail as:

z_{t}^{m} = f_{θ} (z_{t}^{a}, z_{t}^{v}) = W \cdot (z_{t}^{a}; z_{t}^{v}) + b

(10)

This is the output of a linear classifier, where

W

denotes the weight matrix applied to the concatenated feature representations of the two modalities, and

b

is the bias term used to maintain output stability. Taking a simple summation strategy as an example (i.e., strategy

θ

corresponds to summation), the independent outputs of each modality can be decomposed as follows:

f_{θ} (z_{t}^{a}, z_{t}^{v}) = W_{a} \cdot z_{t}^{a} + b_{a} + W_{v} \cdot z_{t}^{v} + b_{v}

(11)

Accordingly, we can quantify the contribution of each modality using its softmax output, which is commonly referred to as the confidence of sample

x_{t}

in a given modality, defined as:

P_{a} = s o f t m a x (g (z_{t}^{a})) = s o f t m a x {(W_{a} \cdot z_{t}^{a} + b_{a})}_{y}, P_{v} = s o f t m a x (g (z_{t}^{v})) = s o f t m a x {(W_{v} \cdot z_{t}^{v} + b_{v})}_{y}

(12)

According to Equation (11), a sample

x_{t} = ((x_{t}^{a}, x_{t}^{v}), y_{t})

has confidence scores

P_{a}

and

P_{v}

in the audio and video modalities, respectively. Generally, the relationship between two variables can be characterized either by their difference, which emphasizes absolute variation, or by their ratio, which reflects relative change. In the context of this work, the confidence ratio between the audio and video modalities captures the relative contribution strength across modalities. We define this ratio as the Modality Balance Score (MBS). Under the summation strategy

θ = sum

, the MBS of a multimodal sample

x_{t}

can be expressed as:

{M B S}_{x_{t}} = \frac{P_{a}}{P_{v}} = \frac{s o f t m a x {(W_{a} \cdot z_{t}^{a} + b_{a})}_{y}}{s o f t m a x {(W_{v} \cdot z_{t}^{v} + b_{v})}_{y}}

(13)

4.2. Multimodal Sample Selection Strategy: Based on MBS

Evidently,

{M B S}_{x_{t}}

reflects the ratio of contributions from the two modalities of a multimodal sample

x_{t}

in the gradient descent computation. When MBS approaches 1, it indicates that the two modalities contribute comparably. Conversely, when

M B S > 1

or

0 < M B S < 1

, one modality dominates, leading to modality imbalance. Therefore, in principle, sample selection should retain as many samples as possible with MBS values close to 1, while discarding those dominated by a single modality. Specifically, samples with

{M B S}_{x_{t}} \in (0,1]

and

{M B S}_{x_{t}} \in (1, + \infty)

should be preserved in equal proportions.

As mentioned in Section 3.2, to facilitate our study, we treat the notion of “early training” as an independent pre-training process, with the number of pre-training epochs denoted by

R

. Accordingly, we define the batch of samples after

R

rounds of pre-training as the set:

B_{R} = \{x_{1}, x_{2}, \dots, x_{n}\}

(14)

Let the selection ratio be

ρ \in (0,1]

. Then, the sample set obtained after applying the selection strategy

μ

can be expressed as:

B_{R}^{μ} = {T o p}_{ρ} (B_{R}, μ (x_{t}))

(15)

In principle,

{Top}_{ρ} (B_{R}, μ (x_{t}))

denotes selecting the top

ρ \cdot n

samples from the set

B_{R}

that minimize the function value. The selection strategy is defined as

μ (x_{t}) = ∣ MBS (x_{t}) - 1 ∣

, which we refer to as the modality balance deviation.

B_{R}^{μ <} = {T o p}_{\frac{ρ}{2}} (B_{R}, |{M B S}_{x_{t}} - 1|) B_{R}^{μ >} = {T o p}_{\frac{ρ}{2}} (B_{R}, |{M B S}_{x_{t}} - 1|) B_{R}^{μ} = B_{R}^{μ <} \cup B_{R}^{μ >}

(16)

In practice, since the learning task over the sample set inherently exhibits varying degrees of dependence on different modalities, the median MBS of the samples after

R

rounds of pre-training is typically not equal to 1. If the theoretical selection strategy were applied directly, it might result in insufficient samples being retained for formal training. Therefore, let the median MBS of the samples after

R

rounds of pre-training be denoted as

m

. Based on this, the practical selection strategy with respect to the MBS is defined as:

B_{R}^{μ <} = {T o p}_{\frac{ρ}{2}} (B_{R}, |{M B S}_{x_{t}} - m|) B_{R}^{μ >} = {T o p}_{\frac{ρ}{2}} (B_{R}, |{M B S}_{x_{t}} - m|) B_{R}^{μ} = B_{R}^{μ <} \cup B_{R}^{μ >}

(17)

At this point, the sample set

B_{R}^{μ}

, obtained through the selection strategy

μ

, can be utilized for subsequent incremental training. This process can be expressed as Algorithm 1.

Algorithm 1: Multimodal Sample Selection based on MBS

Input:
Batch of samples

B_{R} = \{x_{1}, x_{2}, \dots, x_{n}\}

Pretrained encoders

h_{a}, h_{v}

Strategy parameters

μ = {W_{a}, W_{v}, b_{a}, b_{v}}

Selection ratio

ρ \in (0,1]

Output:
Filtered sample set

B_{R}^{μ}

Steps:
1: Initialize

{M B S}_{l i s t} = []

2: For each sample

x_{i} \in B_{R}

do
3:

z_{i}^{a} = h_{a} (x_{i}^{a})

4:

z_{i}^{v} = h_{v} (x_{i}^{v})

5:

p_{i}^{a}

=

s o f t m a x {(W_{a} \cdot z_{t}^{a} + b_{a})}_{y_{i}}

6:

p_{i}^{v}

=

s o f t m a x {(W_{v} \cdot z_{t}^{v} + b_{v})}_{y_{i}}

7:

{M B S}_{i}

=

p_{i}^{a} / p_{i}^{v}

8: Append (

x_{i}

,

{M B S}_{i}

) to

{M B S}_{l i s t}

9: End For
10: Compute

m = m e d i a n (M B S_{i}| (x_{i}, M B S_{i}) \in M B S_{l i s t})

11: Initialize

B_{R}^{μ <} = [], B_{R}^{μ >} = []

12: For each

(x_{i}, M B S_{i}) \in M B S_{l i s t}

do
13: If

0 < M B S_{i} \leq 1

then
14: Append

(x_{i}, |M B S_{i} - m|)

to

B_{R}^{μ <}

15: Else if

M B S_{i} > 1

then
16: Append

(x_{i}, |M B S_{i} - m|)

to

B_{R}^{μ >}

17: End If
18: End For
19: Sort

B_{R}^{μ <}

and

B_{R}^{μ >}

by deviation

|M B S_{i} - m|

in ASC
20:

K = f l o o r (ρ \cdot n / 2)

21: Initialize

B_{R}^{μ}

= { }
22: Add top K samples from

B_{R}^{μ <}

to

B_{R}^{μ}

23: Add top K samples from

B_{R}^{μ >}

to

B_{R}^{μ}

24: Return

B_{R}^{μ}

4.3. Effectiveness

To preliminarily verify the effectiveness of the sample selection strategy based on modality balance deviation, we conducted 10 rounds of pre-training on the CREMA-D dataset. After pre-training, we computed the EL2N scores, GraNd scores, and MBS for all samples in the training set, with the selection ratio set as

ρ \in {0.1,0.2,0.3}

.

Under EL2N score ranking, low-score samples that are easier to learn and yield stable predictions were retained; Under GraNd score ranking, high-score samples that exert greater influence on parameter updates were retained; Under MBS ranking, samples at both ends were discarded, with a proportion of

ρ / 2

removed from each side.

Subsequently, 20 rounds of formal training were performed on the sample sets obtained through the above strategies, and the final model accuracies were recorded as shown in Table 1. The selection strategy based on MBS achieved performance comparable to the other two strategies (details provided in the Appendix A), with the final model accuracy being slightly superior.

This preliminary experiment validates the feasibility and effectiveness of MBS in multimodal sample selection, providing prior support for subsequent extensive experimental evaluations.

5. Evaluation

5.1. Experimental Setup

Our experiments were conducted on two audio-visual multimodal datasets: CREMA-D and AVE. CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset) [27] is a widely used multimodal dataset for emotion recognition research. It contains 7442 audio-visual clips recorded by 91 actors of diverse genders, ages, and ethnic backgrounds. Each clip features performances of six basic emotions (anger, disgust, fear, happiness, neutral, sadness) as well as an additional emotion, “surprise,” expressed under varying intensities and contexts. The dataset provides both audio and video modalities, supporting cross-modal emotion recognition and fusion studies. AVE (Audio-Visual Event Dataset) [28] is designed for audio-visual event recognition. It consists of more than 4000 ten-second audio-visual clips collected from YouTube, covering 28 categories of everyday events (e.g., “dog barking,” “drum beating,” “street noise,” “crowd cheering”). A distinctive feature of this dataset is that events often exhibit both audio and visual manifestations, but the degree of modality dependence varies significantly. For instance, some events rely primarily on audio signals (e.g., “thunder”), while others depend more on visual information (e.g., “playing musical instruments”).

All experiments in this study were conducted on a high-performance workstation equipped with two NVIDIA RTX A4000 GPUs (16 GB memory each), an Intel Core i9-12900 processor, and 64 GB RAM. The software environment was based on Ubuntu 24.04, with Python 3.9 as the primary programming language. Deep learning experiments were implemented using PyTorch 2.2, combined with CUDA 12.1 and cuDNN 8.9 to fully leverage GPU capabilities.

5.2. Experiment A: Multi-Attribute Observation of the Sample Set

For the CREMA-D and AVE datasets, we first randomly split each dataset into training and test sets with an 8:2 ratio. We then conducted 30 rounds of pre-training and computed the EL2N, GraNd, and MBS values for the training samples involved in this process. A review of the properties described by these three measures is summarized in Table 2.

EL2N score measures the learning difficulty of a sample. A lower EL2N value indicates that the sample is easier for the model to learn, whereas a higher value suggests that the sample is harder to fit, potentially containing noise or lying near the decision boundary; GraNd score quantifies the contribution of a sample to parameter updates. Samples with higher GraNd values generally exert stronger influence on the convergence direction of the model and are therefore more valuable during training; MBS characterizes the balance of contributions between different modalities within a multimodal sample.

Figure 4 and Figure 5 present scatter plots illustrating the evolution of three sample attributes across the CREMA-D and AVE datasets during 30 rounds of pre-training. These attributes become increasingly pronounced as the number of pre-training epochs increases.

Regarding the evolution of EL2N and GraNd, both exhibit a rapid decline in scores during the early training stage (epochs 1–10), indicating that the model quickly learns a subset of “easy samples” and gradually establishes a stratified structure of sample difficulty. At the same time, the GraNd distribution in both datasets demonstrates a long-tail pattern, suggesting that a small number of samples exert significant influence on parameter updates.

However, in the CREMA-D dataset, EL2N scores tend to stabilize in the mid-training stage (after approximately 15 epochs), with the proportion of difficult samples gradually shrinking. The tail of the GraNd distribution also converges, reflecting the limited variation in sample difficulty for emotion recognition tasks. In contrast, the AVE dataset maintains a high degree of dispersion in EL2N scores throughout the entire 30-epoch training process, with difficult samples persisting for longer. Its GraNd distribution exhibits an even longer tail, indicating stronger sample heterogeneity in event recognition tasks and a greater reliance of the model on a small set of critical samples.

The evolution of MBS clearly reveals the modality balance differences between the two datasets. In the CREMA-D dataset, MBS values are generally distributed evenly around 1, with a slight bias toward values greater than 1. This indicates that the audio modality contributes more strongly, though the overall balance remains relatively stable. In contrast, the AVE dataset shows MBS values concentrated within the interval

(0,1)

, suggesting that the video modality holds absolute dominance in the sample set.

Moreover, the distribution of CREMA-D gradually converges toward values close to 1 in the mid-training stage, implying that the contributions of audio and video modalities become more balanced and cross-modal synergy is better achieved. By comparison, the AVE distribution remains more dispersed throughout the entire 30 rounds of training, with a substantial proportion of samples still deviating from balanced states. This phenomenon indicates that, relative to emotion recognition tasks, event recognition tasks are more complex, with varying degrees of dependence on visual and auditory information across different events.

Further, we plotted the median ratio

(m_{r} / m_{r - 1})

of the three scores across the CREMA-D and AVE datasets during 30 rounds of pre-training, as shown in Figure 6. This trend provides a more intuitive illustration of how different sample attributes gradually emerge and stabilize throughout the pre-training process. For the CREMA-D dataset, due to the relative simplicity of the task, all three scores tend to stabilize within the first 10–15 epochs. Among them, MBS converges faster than EL2N and GraNd, indicating that modality balance is more easily established in this dataset. In contrast, the AVE dataset exhibits a longer transition period across all three scores within 30 epochs. The convergence speed of MBS is comparable to that of EL2N and GraNd, reflecting that modality balance in complex event recognition tasks requires more time to manifest.

This observation provides important insights for subsequent sample selection strategies and the choice of selection windows: on CREMA-D, the MBS-based selection strategy can take effect at earlier training stages, quickly identifying and retaining modality-balanced samples, thereby improving training efficiency while maintaining performance. On AVE, however, the stability of MBS emerges later, suggesting that the selection window should be appropriately delayed to ensure that the retained samples genuinely reflect modality balance.

5.3. Experiment B: Model Performance Under Different Selection Strategies

To further validate the generalizability and advantages of the MBS-based selection strategy in formal training, we conducted repeated training experiments based on the results of the 30-round pre-training. Specifically, for both the CREMA-D and AVE datasets, we computed sample scores (EL2N, GraNd, and MBS) at the 5th, 10th, 15th, 20th, 25th, and 30th pre-training epochs. Using these scores, we applied nine different selection ratios

(0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5)

to construct filtered sample sets, on which multiple rounds of training were performed to ensure stability and statistical significance of the results. In addition, two control groups were introduced: BASE (no selection, retaining all samples) and RANDOM (randomly selecting samples under the same selection ratios). BASE serves as a reference for the upper bound of model performance, representing the best achievable results when no samples are discarded. RANDOM, on the other hand, is used to evaluate the effectiveness of score-driven selection strategies relative to indiscriminate random selection, ensuring that the observed performance improvements indeed stem from informed sample selection rather than mere changes in sample size.

5.3.1. Model Performance on the CREMA-D Dataset

Figure 7 illustrates the final accuracy of the emotion classification model on the CREMA-D dataset under three selection strategies (EL2N, GraNd, and MBS), applied with different pre-training epochs and selection ratios. Overall, as the number of pre-training epochs increases, the performance of the three score-based selection strategies on CREMA-D gradually shifts from fluctuation to stability. Specifically, in the early training stage (e.g., the 5th and 10th epochs), the results of EL2N-based and GraNd-based selection remain unstable, with model accuracy showing only limited improvement over the RANDOM baseline and, in some cases, even degradation under certain selection ratios. This indicates that the corresponding scores have not yet sufficiently converged, making it difficult to effectively distinguish valuable samples. After the 15th epoch, however, the performance of EL2N and GraNd improves and stabilizes, suggesting that their selection effects become more reliable at this stage. In contrast, the MBS-based strategy demonstrates stable advantages at an earlier stage (around the 10th epoch), significantly outperforming RANDOM under low selection ratios. This suggests that modality balance can serve as a reliable selection criterion even in the early stages of training.

Focusing on several key epochs after stabilization (post-15th epoch), the differences among the three strategies under varying selection ratios become clearer. At low ratios (0.1–0.2), MBS consistently achieves the highest accuracy, in some cases approaching the BASE performance, indicating that it can compress the sample set while maintaining model performance. During this stage, GraNd performs slightly better than EL2N but remains inferior to MBS. EL2N maintains reasonable performance at low ratios but declines more rapidly as the ratio increases. At medium-to-high ratios (0.3–0.4), all three strategies experience performance drops, but MBS exhibits the smallest decline, demonstrating stronger robustness. Even at higher ratios (0.4–0.5), where EL2N and GraNd nearly fail (with GraNd only effective when the model is close to convergence), MBS still manages to control performance loss within a relatively small range.

5.3.2. Model Performance on the AVE Dataset

Figure 8 presents the final accuracy of the emotion classification model on the AVE dataset under three selection strategies (EL2N, GraNd, and MBS), applied with different pre-training epochs and selection ratios. Compared with CREMA-D, the accuracy fluctuations on AVE are noticeably smaller across all three strategies. This indicates that event recognition tasks exhibit greater overall stability in performance, but it also implies that differences among strategies require more fine-grained comparison to be revealed.

From the overall trend, as the number of pre-training epochs increases, the performance of all three strategies gradually shifts from slight early fluctuations to stability. Unlike CREMA-D, the early instability of EL2N and GraNd is less pronounced on AVE; however, their improvements under low selection ratios remain limited, and their performance declines more sharply at higher ratios. In contrast, the MBS-based strategy begins to demonstrate smaller performance fluctuations in the mid-training stage (after the 15th epoch). At medium-to-high selection ratios (0.3–0.4), although all three strategies experience performance drops, MBS exhibits the smallest decline, showing stronger robustness.

5.4. Comprehensive Analysis

In the previous experimental results, we have already observed performance differences among the three strategies on the CREMA-D and AVE datasets. To avoid drawing conclusions limited to a single dataset, this section further provides a unified comparison and quantitative analysis of the three strategies from the perspective of abstract optimization objectives, focusing on two dimensions: robustness and efficiency. By constructing robustness scores and efficiency scores, we can more intuitively reveal the strengths and weaknesses of different strategies under cross-task conditions. For strategy

μ

, the robustness score is calculated as the weighted sum of wins across different epochs

\sum {s (ρ)}_{w i n}^{μ}, s (ρ) = ρ

, as shown in Table 3 and Table 4. Similarly, the efficiency score is computed as the weighted sum of wins across different selection ratios

\sum {s (R)}_{w i n}^{μ}, s (R) = R^{r e v} / 10

, as shown in Table 5 and Table 6.

By mapping the performance across different datasets to the abstract optimization objective defined in Equation (9) of Section 3.2, and combining the robustness and efficiency results presented in the tables above, we can conduct a comprehensive analysis of the three score-driven sample selection strategies: EL2N, GraNd, and MBS.

EL2N (

μ^{E}

) shows certain limitations in terms of efficiency. As indicated in Table 5 and Table 6, its efficiency scores remain acceptable under low selection ratios, but the advantage quickly diminishes as the ratio increases, suggesting that its efficiency cost

C_{eff}

rises significantly at higher ratios. Regarding robustness (Table 3 and Table 4), EL2N exhibits strong sensitivity on both datasets, with performance degrading sharply under high ratios, leading to a higher robustness cost

C_{rob} (μ^{E}, ρ)

. Therefore, EL2N is more suitable for scenarios with medium-to-low selection ratios and relatively sufficient resources, but less appropriate for extreme compression or cross-task conditions.

GraNd (

μ^{G}

) demonstrates a more stage-dependent efficiency pattern. While its efficiency scores are relatively high at certain ratios (e.g., 0.1–0.2), the overall distribution is scattered, indicating unstable efficiency cost

C_{eff}

. In terms of robustness, GraNd shows considerable fluctuations on both CREMA-D and AVE, with performance dropping rapidly at higher ratios, resulting in a higher robustness cost

C_{rob} (μ^{G}, ρ)

. Overall, GraNd can capture critical samples under specific conditions but lacks stability across tasks, making it more suitable as a supplementary strategy rather than a primary one.

MBS (

μ^{M}

) achieves a more balanced performance across both efficiency and robustness. As shown in Table 5 and Table 6, its efficiency scores remain consistently high across multiple ratios, particularly stable in the 0.25–0.4 range, with relatively low efficiency cost

C_{eff} (R)

. Table 3 and Table 4 further demonstrate that MBS achieves the highest robustness scores on CREMA-D, and although its advantage emerges later on AVE, its overall fluctuations are the smallest, yielding the lowest robustness cost

C_{rob} (μ^{M}, ρ)

. Taken together, these results suggest that MBS exhibits stronger generalizability under different task conditions.

In summary, by unifying the performance of the three strategies into the optimization objective:

J (μ^{M}, ρ, R) | < | J (μ^{E}, ρ, R) | < | J (μ^{G}, ρ, R)

(18)

It is important to emphasize that the three strategies do not exhibit fully homogeneous performance in terms of efficiency and robustness. MBS (

μ^{M}

) demonstrates overall superiority under cross-task conditions, particularly at moderate selection ratios (0.25–0.4), where it effectively balances efficiency and robustness, making it a relatively stable choice. EL2N (

μ^{E}

) retains practical utility at medium-to-low ratios, but its stability depends on a greater number of pre-training epochs. GraNd (

μ^{G}

), while capable of capturing critical samples in specific scenarios, incurs relatively higher efficiency and robustness costs overall. Therefore, the main advantage of MBS lies in its generalizability and stability, though it is not absolutely optimal under all conditions. EL2N and GraNd can still serve as complementary strategies in particular contexts, providing additional flexibility in score-driven sample selection.

5.5. Stress Testing and Ablation Analysis

To further examine the behavioral differences among various sample selection strategies, we conduct an ablation-style analysis under several extreme and rarely encountered conditions, including very high pruning ratios (0.80–0.95) and inadequate pre-training phase. Although such settings are not representative of typical deployment scenarios, they provide a stress test that helps reveal the intrinsic robustness and failure modes of each strategy. The two additional experimental figures (Figure 9) visualize the performance trends across pruning ratios of 0.80–0.95 under pretraining epochs of 15.

Across all conditions, EL2N consistently exhibits the weakest performance, with accuracy degrading sharply as the pruning ratio increases. This behavior is closely tied to its underlying metric: by relying solely on prediction error to quantify sample difficulty, EL2N fails to distinguish modality-specific contributions. When a multimodal sample is informative in one modality but weak in another, EL2N often misjudges its importance, leading to the systematic removal of samples that contain valuable cross-modal information. As pruning becomes more aggressive, this misalignment is amplified, ultimately causing EL2N to collapse.

In contrast, the behaviors of GraNd, MBS, and Random selection display clear dataset-dependent patterns. On CREMA-D, where audio and visual modalities are both of high quality and exhibit strong semantic alignment, the natural balance between modalities reduces the need for explicit modality-aware regulation. The dataset’s well-distributed diversity means that even random selection tends to preserve a sufficient number of informative samples. As a result, GraNd, MBS, and Random show a noticeable degree of performance “resonance,” with GraNd benefiting from stable gradients and MBS maintaining robustness without a pronounced advantage. In such a balanced setting, the inherent structure of the dataset diminishes the differences among selection strategies.

However, when shifting to the more challenging AVE dataset—characterized by substantial modality disparities, higher noise levels, and weaker cross-modal alignment—the distinctions among strategies become much more pronounced. GraNd’s gradient-based importance estimation becomes highly sensitive to the dominant or noisier modality, resulting in unstable and sometimes misleading importance scores. Random selection also becomes unreliable, as the probability of selecting noisy or modality-inconsistent samples increases significantly. In contrast, MBS explicitly evaluates the relative contributions of each modality and effectively filters out samples dominated by a single modality or corrupted by noise. Consequently, MBS demonstrates markedly stronger robustness on AVE, maintaining stable performance even under extreme pruning conditions.

Taken together, these extreme-condition experiments form an implicit ablation study. The results indicate that the performance gains of MBS cannot be attributed to sample diversity—otherwise Random would perform comparably—nor to difficulty or gradient magnitude alone—otherwise EL2N/GraNd would consistently dominate. Instead, the key determinant of robustness is the ability to identify and retain samples with balanced modality contributions. The consistent performance of MBS across both datasets, underscores the central role of modality balance in achieving stable multimodal sample selection under resource-constrained incremental learning.

6. Conclusions

This study investigates the problem of effectively filtering streaming multimodal samples under resource-constrained incremental learning, and proposes as well as systematically validates a sample selection strategy based on the Modality Balance Score (MBS). We first define and compute MBS to quantify the balance of contributions among modalities during training. Based on this definition, we design an MBS-driven selection strategy and conduct comparative analyses against existing EL2N and GraNd strategies. Experiments on two representative multimodal datasets, CREMA-D and AVE, examine the evolution of scores across different pre-training stages and the performance under varying selection ratios, thereby providing a comprehensive evaluation of the effectiveness and generalizability of MBS.

Extensive experimental results reveal that dataset characteristics directly influence the performance of selection strategies. For datasets such as CREMA-D, where modality balance is relatively strong, the MBS-based strategy can take effect at earlier training stages, quickly identifying and retaining modality-balanced samples, thus improving training efficiency while maintaining performance. In contrast, for datasets such as AVE, which exhibit stronger modality imbalance and higher task complexity, the advantages of MBS emerge later. Nevertheless, under medium-to-high selection ratios, MBS still demonstrates smaller performance fluctuations and stronger robustness, indicating its reliability as a selection criterion in complex scenarios.

In resource-constrained environments where data arrive in a streaming fashion and modality differences are pronounced, the MBS-based selection strategy not only prevents the model from over-relying on a single modality but also maintains performance stability under extreme compression conditions. This makes MBS a more practical and valuable strategy for incremental learning, providing solid support for intelligent systems in complex task scenarios. While MBS does not aim to replace difficulty-based selection, it offers consistent improvements in robustness and stability under modality imbalance, which is not addressed by existing strategies. Future work will explore the applicability of MBS to larger-scale and more complex multimodal datasets, and investigate its integration with methods such as dynamic weight allocation and modality completion to achieve more efficient modality rebalancing. In addition, we plan to deploy and validate this strategy in real-world unmanned aerial vehicle (UAV) swarm systems, assessing its real-time capability and scalability, thereby advancing the practical implementation of multimodal incremental learning.

Author Contributions

Conceptualization, Y.X.; methodology, Y.X.; software, J.L., C.Z. and H.W.; resources, B.C.; writing—review and editing, Y.X.; supervision, B.C.; funding acquisition, B.C. and F.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China, under Grant 62176122; in part by the Aeronautical Science Foundation of China, under Grant 2023Z073052003.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Appendix A.1

Algorithm A1: Sample Selection Strategy based on GraNd

Input:
grand_scores: List[float], GraNd score for each sample
keep_ratio: float, proportion of samples to keep (between 0 and 1)
Output:
selected_indices: List[int], indices of retained samples
Steps:
1: Get total number of samples
2: num_samples = len(grand_scores)
3: Compute number of samples to keep
4: num_keep = int(num_samples × keep_ratio)
5: Sort scores in descending order
6: sorted_indices = argsort(grand_scores, descending = True)
7: Select top num_keep samples
8: selected_indices = sorted_indices[:num_keep]
9: Return selected indices
10: return selected_indices

Appendix A.2

Algorithm A2: Sample Selection Strategy based on EL2N

Input:
el2n_scores: List[float], EL2N score for each sample
keep_ratio: float, proportion of samples to keep (between 0 and 1)
Output:
selected_indices: List[int], indices of retained samples
Steps:
1: Get total number of samples
2: num_samples = len(el2n_scores)
3: Compute number of samples to keep
4: num_keep = int(num_samples × keep_ratio)
5: Sort scores in descending order
6: sorted_indices = argsort(el2n_scores, descending = False)
7: Select top num_keep samples
8: selected_indices = sorted_indices[:num_keep]
9: Return selected indices
10: return selected_indices

References

Liu, H.I.; Galindo, M.; Xie, H.; Wong, L.-K.; Shuai, H.-H.; Li, Y.-H.; Cheng, W.-H. Lightweight deep learning for resource-constrained environments: A survey. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
Shuvo, M.M.H.; Islam, S.K.; Cheng, J.; Morshed, B.I. Efficient acceleration of deep learning inference on resource-constrained edge devices: A review. Proc. IEEE 2022, 111, 42–91. [Google Scholar] [CrossRef]
Landola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2017, arXiv:1602.07360. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the International Conference on Learning Representations, ICLR, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5270–5279. [Google Scholar]
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In Proceedings of the 4th International Conference on Learning Representations, ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. In Proceedings of the International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
Lee, N.; Ajanthan, T.; Torr, P. Snip: Single-Shot Network Pruning Based on Connection Sensitivity. In Proceedings of the International Conference on Learning Representations, ICLR, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Chen, Y.H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 2016, 52, 127–138. [Google Scholar] [CrossRef]
Kang, Y.; Hauswald, J.; Gao, C.; Rovinski, A.; Mudge, T.; Mars, J.; Tang, L. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGARCH Comput. Archit. News 2017, 45, 615–629. [Google Scholar] [CrossRef]
De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; Tuytelaars, T. A continual learning survey: Defying forgetting in classification tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3366–3385. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Zhang, X.; Su, H.; Zhu, J. A comprehensive survey of continual learning: Theory, method and application. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5362–5383. [Google Scholar] [CrossRef] [PubMed]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2001–2010. [Google Scholar]
Wang, K.; Herranz, L.; van de Weijer, J. Continual Learning in Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 1–10. [Google Scholar]
Wang, Z.; Chen, Y.; Xu, C. Uncertainty-aware Sample Selection for Multimodal Continual Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 11245–11255. [Google Scholar]
Paul, M.; Ganguli, S.; Dziugaite, G.K. Deep learning on a data diet: Finding important examples early in training. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021; Volume 34, pp. 20596–20607. [Google Scholar]
Yang, Z.; Yang, H.; Majumder, S.; Cardoso, J.; Gallego, G. Data Pruning Can Do More: A Comprehensive Data Pruning Approach for Object Re-identification. arXiv 2024, arXiv:2412.10091. [Google Scholar] [CrossRef]
Gong, C.; Zheng, Z.; Wu, F.; Shao, Y.; Li, B.; Chen, G. To store or not? Online data selection for federated learning with limited storage. In Proceedings of the ACM Web Conference (WWW), Austin, TX, USA, 30 April–4 May 2023; pp. 3044–3055. [Google Scholar]
Peng, X.; Wei, Y.; Deng, A.; Wang, D.; Hu, D. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8238–8247. [Google Scholar]
Fan, Y.; Xu, W.; Wang, H.; Wang, J.; Guo, S. PMR: Prototypical modal rebalance for multimodal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 20029–20038. [Google Scholar]
Mahmoud, A.; Elhoushi, M.; Abbas, A.; Yang, Y.; Ardalan, N.; Leather, H.; Morcos, A. Sieve: Multimodal dataset pruning using image captioning models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; p. 22423. [Google Scholar]
Ye, W.; Wu, Q.; Lin, W.; Zhou, Y. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 22128–22136. [Google Scholar]
Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Shi, J.; Li, B.; Duan, Z.; Xu, C. Audio-Visual Event Localization in Unconstrained Videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 247–263. [Google Scholar]

Figure 1. Illustration of a UAV swarm system for multimodal streaming data collection and edge-side incremental learning.

Figure 2. The multimodal feature encoding and fusion pipeline, showing audio MFCC–RNN and visual MobileNet_V2–RNN encoders, followed by cross-modal fusion via FusedSelf Attention and classification.

Figure 3. Overview of the multimodal sample selection process in incremental learning, illustrating early-stage scoring, modality balance evaluation, and selection under resource constraints.

Figure 4. Log-Scaled Scores per Sample Across Epochs on CREMA-D.

Figure 5. Log-Scaled Scores per Sample Across Epochs on AVE.

Figure 6. Median Ratio vs. Epoch on 2 Datasets.

Figure 7. Accuracy vs. Pruning Ratio for Different Strategies on CREMA-D.

Figure 8. Accuracy vs. Pruning Ratio for Different Strategies on AVE.

Figure 9. Performance under Extreme Pruning Ratios on CREMA-D and AVE.

Table 1. Final model accuracy for preliminary test for 10 rounds of pre-training + 20 rounds of formal training. (All tables use bold numbers to represent the maximum value of each column.)

	0.1	0.2	0.3
Strategy	0.1	0.2	0.3
EL2N	0.8000	0.8097	0.7870
GraNd	0.8065	0.8076	0.8011
MBS	0.8259	0.8141	0.8065

Table 2. Characteristics of EL2N, GraNd, and MBS.

Metric	Focus	Sample Attribute Described
EL2N	Learning Difficulty	Indicates whether a sample is easy for the model to learn, and whether it contains noise or lies near the decision boundary
GraNd	Contribution to Parameter Updates	Reflects the extent to which a sample drives the training direction of the model
MBS	Modality Balance	Characterizes whether the contributions of different modalities within a multimodal sample are balanced

Table 3. Robustness Scores of Different Strategies on the CREMA-D Dataset.

	5th	10th	15th	20th	25th	30th
Strategy	5th	10th	15th	20th	25th	30th
RANDOM	1.70	0.95	0.20	0.85	0.70	0
EL2N	0.55	0.625	0	0.15	0.175	0.70
GraNd	0.45	0.10	0	0	0	1.35
MBS	0	1.025	2.50	1.70	1.825	0.65

Table 4. Robustness Scores of Different Strategies on the AVE Dataset.

	5th	10th	15th	20th	25th	30th
Strategy	5th	10th	15th	20th	25th	30th
RANDOM	0.8	0.40	0.2	0.25	0.25	0.25
EL2N	0.20	0.45	0.65	0.80	0.60	0.25
GraNd	0.45	0.70	0	0.375	0.20	1.55
MBS	1.25	1.15	1.85	1.275	1.65	0.65

Table 5. Efficiency Scores of Different Strategies on the CREMA-D Dataset.

	0.1	0.15	0.2	0.25	0.3	0.35	0.4	0.45	0.5
Strategy	0.1	0.15	0.2	0.25	0.3	0.35	0.4	0.45	0.5
RANDOM	0	0	3.75	0	0	6.0	4.5	0	1.0
EL2N	4.0	5.5	2.5	1.75	6.0	0	0	0	0
GraNd	2.5	0	3.0	3.0	0	0	0.5	0.5	0.5
MBS	4.0	5.0	1.25	5.75	4.5	4.5	5.5	10	9.0

Table 6. Efficiency Scores of Different Strategies on the AVE Dataset.

	0.1	0.15	0.2	0.25	0.3	0.35	0.4	0.45	0.5
Strategy	0.1	0.15	0.2	0.25	0.3	0.35	0.4	0.45	0.5
RANDOM	2.5	0	2.0	6.0	0	0	4.0	0	0
EL2N	0	2.0	4.0	2.5	0.33	1.5	1.5	0	5.0
GraNd	4.5	7.5	3.5	0	0.33	0	0.5	0.5	2.5
MBS	3.5	1.0	1.0	2.0	9.83	9.0	4.5	10.0	3.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, Y.; Chen, B.; Hu, F.; Liu, J.; Zhao, C.; Wu, H. MBS: A Modality-Balanced Strategy for Multimodal Sample Selection. Mach. Learn. Knowl. Extr. 2026, 8, 17. https://doi.org/10.3390/make8010017

AMA Style

Xu Y, Chen B, Hu F, Liu J, Zhao C, Wu H. MBS: A Modality-Balanced Strategy for Multimodal Sample Selection. Machine Learning and Knowledge Extraction. 2026; 8(1):17. https://doi.org/10.3390/make8010017

Chicago/Turabian Style

Xu, Yuntao, Bing Chen, Feng Hu, Jiawei Liu, Changjie Zhao, and Hongtao Wu. 2026. "MBS: A Modality-Balanced Strategy for Multimodal Sample Selection" Machine Learning and Knowledge Extraction 8, no. 1: 17. https://doi.org/10.3390/make8010017

APA Style

Xu, Y., Chen, B., Hu, F., Liu, J., Zhao, C., & Wu, H. (2026). MBS: A Modality-Balanced Strategy for Multimodal Sample Selection. Machine Learning and Knowledge Extraction, 8(1), 17. https://doi.org/10.3390/make8010017

Article Menu

MBS: A Modality-Balanced Strategy for Multimodal Sample Selection

Abstract

1. Introduction

2. Related Work

2.1. Incremental Learning in Resource-Constrained Environments

2.2. Data Selection Based on Quantified Sample Attributes

3. Basic Model

3.1. Dual-Modality Incremental Learning Paradigm

3.2. Optimization Objective

4. Main Method

4.1. Modality Balance Quantification: MBS

4.2. Multimodal Sample Selection Strategy: Based on MBS

4.3. Effectiveness

5. Evaluation

5.1. Experimental Setup

5.2. Experiment A: Multi-Attribute Observation of the Sample Set

5.3. Experiment B: Model Performance Under Different Selection Strategies

5.3.1. Model Performance on the CREMA-D Dataset

5.3.2. Model Performance on the AVE Dataset

5.4. Comprehensive Analysis

5.5. Stress Testing and Ablation Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI