1. Introduction
Reinforcement learning agents operating in the physical world receive information through multiple sensors: cameras capture spatial structure, proprioceptive sensors report joint angles, language instructions convey goals, and reward signals provide sparse feedback. These modalities differ vastly in noise level, sampling rate, and information density. A robust multimodal RL system must combine them without allowing noise from one channel to degrade the others.
Recent RL architectures address multimodal fusion in two broad ways.
Monolithic approaches process all modalities through a single shared backbone. DreamerV3 (Google DeepMind, London, UK) [
1] feeds observations through a shared encoder–RSSM pipeline with fixed hyperparameters, mastering over 150 tasks spanning Atari to robotic control.
0 [
2] tokenizes vision, language, state, and action noise into a 3B-parameter PaliGemma transformer with flow matching for continuous action generation. Genie 2 [
3] and DIAMOND [
4] apply diffusion to world model prediction and video generation, respectively. These systems rely on model scale to implicitly absorb distributional differences across modalities.
Modular approaches assign a dedicated encoder to each modality and then fuse the outputs. MSDP [
5] pretrains a multisensory encoder via masked autoencoding across vision, force, and proprioception, and then it fuses embeddings through cross-attention. VTDexManip [
6] concatenates CLIP visual features with tactile MLP features for dexterous manipulation, finding that adding sparse touch signals improves success rates by 20%. MIB [
7] compresses a joint representation via the information bottleneck principle, filtering task-irrelevant information after fusion. These methods use separate encoders but do not equalize signal quality before fusion: the noisy output of a sparse reward encoder enters the same fusion layer as a well-encoded CLIP embedding.
Both paradigms thus share a common vulnerability: cross-modal noise contamination at the point of fusion. When a high-variance modality (e.g., sparse reward, raw IMU acceleration) is fused with a low-variance one (e.g., CLIP vision embedding), noise from the former can corrupt the latter. This problem is especially acute in sensor-driven domains such as wearable health management, where modalities have radically different noise profiles and temporal scales—physical movement precedes heart-rate elevation by several seconds, and language instructions precede visual confirmation by many steps.
A second gap is the
absence of principled spectral-domain fusion. Standard attention computes instantaneous pairwise similarities and cannot naturally represent temporally delayed cross-modal correlations. A third gap is the
limited application to healthcare. MedDreamer [
8] applied an RSSM to electronic health records, and KANDI [
9] used diffusion policies for elderly activity promotion, but no prior work has combined per-modality generative refinement with spectral fusion for wearable-sensor health intervention.
This paper addresses these gaps through a design principle we call
filter before mixing: each modality’s representation is independently refined by a dedicated Flow Matching module before cross-modal fusion, which is analogous to the signal processing practice of filtering each channel before mixing into a master bus. The refinement intensity adapts to each modality’s signal quality: weak for already well-encoded signals (e.g., frozen CLIP embeddings) and strong for noisy or sparse signals (e.g., raw reward, IMU data). The refined representations are then fused in the spectral domain via a Fourier Neural Operator (FNO), whose complex-valued weights naturally encode phase-shifted cross-modal correlations. An information-theoretic analysis (
Section 3.3) shows that pre-fusion denoising preserves more policy-relevant mutual information than post-fusion alternatives with a residual gate mechanism ensuring that the refinement is never harmful.
The resulting four-layer modular architecture integrates modality-specific encoding (frozen CLIP with Slot Attention, or IMU foundation encoders for health applications), per-modality Flow Matching, FNO spectral fusion, and DNC episodic memory. The modular design enables domain transfer: the same downstream pipeline (Flow Matching → FNO → DNC → PPO) is shared between MiniGrid (Farama Foundation,
https://minigrid.farama.org) navigation and PAMAP2 (Technische Universität Darmstadt, Darmstadt, Germany) health management with only the modality-specific encoders replaced.
We validate the framework in three domains. In MiniGrid navigation, the system achieves 94.0% success on MultiRoom-N2-S4 (+51 pp over IMPALA (Google DeepMind, London, UK) under identical conditions) and 93.0% on N4-S4, while IMPALA fails entirely on harder configurations (0% on N4-S5 and N6). In Crafter (Danijar Hafner, University of Toronto, Toronto, ON, Canada), a procedurally generated open-world environment with no language instructions, the agent achieves an official score of 16.2 ± 0.8% (10 seeds), exceeding DreamerV3 (14.5%) despite operating without the language modality. In wearable-sensor health management on the PAMAP2 dataset—where no pretrained encoder exists and per-modality denoising is critical—the foundation encoder achieves a 2.4× higher cumulative reward () and 16× lower cross-seed variance than a vanilla MLP baseline. This progression—MiniGrid (four modalities, CLIP available), Crafter (three modalities, no language), PAMAP2 (three modalities, no pretrained encoder)—illustrates a key insight: the advantage of filter before mixing grows with encoder noise and is robust to missing modalities.
The contributions of this paper are as follows:
We propose the
filter-before-mixing design principle for multimodal RL, in which per-modality Flow Matching denoises each representation before spectral-domain fusion, and we provide an information-theoretic justification with an explicit approximation bound (Equation (
A11)).
We instantiate this principle in FreamerV1 (Filter-before-mixing dreamer), a modular four-layer architecture (93 M parameters, 0.4 M trainable), and validate it on MiniGrid navigation with controlled baselines (IMPALA under identical conditions, PPO).
We validate across three domains—MiniGrid (four modalities), Crafter (three modalities, no language), and PAMAP2 (three modalities, no pretrained encoder)—showing that the architecture degrades gracefully with missing modalities and that the filter-before-mixing advantage grows with encoder noise.
Table 1 summarizes how these design choices differentiate FreamerV1 from
0. These are not claims of superiority; they reflect optimization for a different regime—sensor-driven decision making with discrete actions, limited data, and the need for interpretable, modular policies.
3. Proposed Method
This section describes the architecture that instantiates the filter-before-mixing principle.
3.1. Architecture Overview
The architecture consists of four layers (
Figure 1):
Layer 1—Modality-Specific Encoding. Each of four input modalities (state, vision, language, reward) is encoded by a dedicated module. Vision uses a frozen CLIP ViT-B/32 (OpenAI, San Francisco, CA, USA) [
11] with Slot Attention [
12]; language uses a Transformer encoder [
22] with Slot Attention; state uses an FNO-based encoder [
17]; reward uses a temporal Conv1D encoder. All encoders produce
d-dimensional embeddings.
Layer 2—Per-Modality Flow Matching. Each modality embedding is refined by an independent Flow Matching module [
23] that learns an optimal transport map from a Gaussian prior to the modality’s data manifold, denoising the representation before cross-modal fusion.
Layer 3—FNO Spectral Fusion. The refined modality embeddings are stacked as a
matrix and fused via spectral-domain convolution [
17] (SpectralConv1d), producing a unified representation.
Layer 4—DNC Memory + Policy. The fused representation is augmented with episodic context from a DNC memory module [
18], and then it is fed to a PPO [
24] policy head that produces a discrete action.
The total parameter count is 93.05 M, of which 92.68 M (99.6%) are the frozen CLIP encoder. The trainable parameters amount to 0.37 M, enabling efficient learning on small datasets.
3.2. Layer 1: Modality-Specific Encoding
Each modality is encoded by a dedicated module producing a
d-dimensional embedding (details in
Appendix E).
For vision, a frozen CLIP ViT-B/32 [
11] extracts 49 patch features (
grid, 768-dim), which are projected to
. Slot Attention [
12] decomposes the features into
K object-centric slots via iterative competitive assignment. Four numerical stabilization techniques reduce the NaN occurrence rate from 23.5% to 0% (see
Section 4.2 for details).
For language, the mission text is processed by a Transformer encoder and decomposed into
slots via Slot Attention. An LLM (Qwen2.5 [
25]) parses the mission into structured components for reward shaping (
Appendix A).
For state, the MiniGrid observation (
) is flattened and encoded by a 1D FNO block [
17].
For reward, the past H-step reward history is encoded by a temporal Conv1D.
Bidirectional multi-head cross-modal attention between visual and language slots enables component-level correspondence (e.g., “green goal” slot ↔ green object slot). The resulting slots are mean-pooled to produce modality embeddings .
3.3. Layer 2: Per-Modality Flow Matching
For each modality , we apply an independent Flow Matching module to refine the encoded representation before cross-modal fusion. This implements the “filter-before-mixing” principle each modality’s representation is denoised by a dedicated Flow Matching module before cross-modal fusion via a Fourier Neural Operator in the spectral domain.
Each module learns a velocity field
via Conditional Flow Matching (CFM) [
23] and refines
by Euler integration from
to
over
N steps (see
Appendix E for the full formulation):
A learnable residual gate ensures training stability:
where
is initialized to
so that
at the start of training, ensuring that the FM module acts as a near-identity function until its velocity field is sufficiently trained. The refinement intensity can be set independently per modality: weak for already well-encoded signals (e.g., frozen CLIP:
,
) and strong for noisy signals (e.g., sparse reward:
,
). Rather than tuning
manually, we adopt OGM-GE gradient modulation [
26] to adaptively scale the
gradients (with
learning rate amplification) based on each modality’s FM loss convergence rate, improving mean performance by +7 pp and halving seed-to-seed variance (
Section 4.3.5).
We provide an information-theoretic argument for why applying Flow Matching to each modality independently
before fusion is preferable to applying it after fusion (see
Appendix L for the full derivation).
Let denote the encoded representations of M modalities, which are each corrupted by modality-specific noise: , where and the noise variances differ across modalities.
In
post-fusion denoising (as in
0), the noisy representations are first fused as
and then denoised. By the data processing inequality [
27], information lost during noisy fusion cannot be recovered. Cross-modal noise propagates through shared projection weights.
In
pre-fusion denoising (our approach), a modality-specific denoiser
is applied to obtain
; then, the cleaned representations are fused. If each FM achieves near-optimal denoising,
In practice, the FM modules are imperfect denoisers with error
. Pre-fusion denoising remains beneficial whenever
—a condition much weaker than optimal denoising (see
Appendix L for the full derivation). The residual gate further tightens this bound: since
when the velocity field is untrained, pre-fusion denoising is
never worse than no denoising.
The above argument relies on two key assumptions: (i) modality-specific noise is approximately Gaussian, and (ii) FM achieves denoising error
. We assess these for each domain: In MiniGrid, the frozen CLIP encoder produces near-deterministic embeddings for any given frame (
), while the sparse reward signal has high variance (
) due to rare success events. The Gaussian assumption holds approximately because the noise arises from the stochastic policy’s action sampling rather than sensor noise. In Crafter, vision noise increases due to procedural generation, and the absence of language removes one modality entirely, but the remaining three satisfy the same conditions. In PAMAP2, raw IMU and heart-rate signals have well-characterized sensor noise profiles that are approximately Gaussian, and no pretrained encoder is available to reduce
—precisely the regime where condition (ii) is most easily satisfied and the benefit is the largest.
Section 4.4 empirically validates this prediction via a noise injection experiment, confirming that condition (ii) holds in practice.
3.4. Layer 3: FNO Spectral Fusion
The four refined modality embeddings
are stacked into a matrix
and fused via a Fourier Neural Operator (FNO) [
17]. The FNO applies spectral convolution: FFT along the embedding dimension, multiplication by learnable complex-valued weights
for the first
K Fourier modes, and inverse FFT with a pointwise residual path (see
Appendix E for the full formulation). The fused representation is obtained by mean pooling over the modality axis.
The key advantage of spectral-domain fusion over attention-based fusion is the ability to represent phase-shifted cross-modal correlations. When two modalities have a temporal delay in their correlation (e.g., IMU activity precedes heart-rate elevation), this appears as a frequency-dependent phase shift in the Fourier domain. The FNO’s complex-valued weights can directly encode such shifts through , whereas real-valued attention computes instantaneous inner products and cannot represent phase delays without auxiliary mechanisms.
It is important to distinguish the role of the FNO here from the FNO-based state encoder in Layer 1. The Layer 1 FNO encodes a single modality (the grid observation) into a d-dimensional embedding—a standard spectral encoding operation. The Layer 3 FNO, by contrast, operates after per-modality Flow Matching and serves a fundamentally different purpose: it acts as a PDE-guided fusion operator over the denoised representations. When different modalities undergo different FM refinement trajectories (because their velocity fields and gate values differ), the resulting representations lie on different regions of the learned manifold. The Layer 3 FNO resolves these heterogeneous trajectories into a single coherent representation by leveraging the spectral structure of the cross-modal correlations—analogous to how an FNO solves a PDE by operating in the frequency domain over spatially heterogeneous initial conditions. This PDE-guidance role becomes increasingly important as the diversity of FM trajectories grows, which occurs when modalities have substantially different noise levels (as in PAMAP2, where raw IMU and heart-rate signals follow very different refinement paths) or when the environment requires the agent to integrate temporally delayed cross-modal signals.
3.5. Layer 4: DNC Memory and PPO Policy
The fused representation
is augmented with episodic context via a Differentiable Neural Computer (DNC) [
18] with content-based addressing, producing a memory-augmented representation
. A categorical PPO [
24] policy head then produces discrete actions:
The DNC architecture details (content-based addressing, write/read operations) and PPO objective formulation are provided in
Appendix F.
3.6. Auxiliary Components
The framework includes several auxiliary mechanisms whose details are in
Appendix G:
Language-grounded reward shaping provides dense supervision in sparse-reward environments by matching LLM-parsed mission structure against observations [
25].
Adaptive reward shaping [
28] decays bonus rewards as the success rate increases.
Success Buffer [
29] stores and replays successful episodes to prevent catastrophic forgetting.
Score-based adaptive complexity estimation [
30] dynamically adjusts the number of FM inference steps per modality.
3.7. Training Objective
The overall loss function combines four terms:
where
is the per-modality Flow Matching loss (Equation (
A6)),
is a vision–language contrastive alignment loss [
11], and
is the FNO smoothness regularization (Equation (
A9)).
The total loss combines four objectives operating at different levels of the architecture: trains the per-modality Flow Matching velocity fields (Layer 2), and the FNO parameters govern cross-modal fusion (Layer 3), aligns vision and language representations (cross-layer), and optimizes the policy (Layer 4).
A potential concern with multi-objective optimization is gradient interference: updates to the FM velocity fields that reduce might increase by changing the representation landscape that the policy has adapted to. We mitigate this through two mechanisms. First, each FM module produces its output through a learnable residual gate , which is initialized near zero (, ). During early training, , so the FM modules do not disrupt the representations that the policy is learning from. As training progresses, grows, gradually introducing the FM refinement. This staged introduction ensures that and do not interfere during the critical early phase. Second, the CLIP visual encoder (92.68M of 93.05M total parameters) is frozen, eliminating the largest source of potential gradient interference. The FM velocity fields, FNO weights, and policy parameters constitute only 0.37M trainable parameters, operating in a low-dimensional optimization landscape where multi-objective conflicts are empirically manageable.
Regarding convergence, the conditional Flow Matching loss
is a regression loss with a unique global minimum, and Lipman et al. [
23] showed that its gradient estimator has bounded variance under the optimal transport conditional path. Combined with PPO’s clipped surrogate objective [
24], which bounds policy updates to a trust region, and the Adam optimizer with gradient clipping (
), the overall training procedure converges reliably in practice.
The full PPO loss is shown below:
4. Experiments
We evaluate the proposed architecture in three complementary settings. First, we verify that the Slot Attention stabilization techniques are effective on the MSR-VTT video–text dataset, as numerical stability is a prerequisite for the downstream Flow Matching and FNO layers. Second, we experiment on MiniGrid navigation tasks to assess the overall performance of the integrated system and to quantify the contribution of each component through ablation. Third, we apply the architecture to wearable-sensor health management on the PAMAP2 dataset (
Section 5) to validate whether the filter-before-mixing principle produces better world models than flat-concatenation or attention-only encoders.
4.1. Experimental Setup
4.1.1. MSR-VTT Experiments
We used the MSR-VTT dataset [
31] comprising 7010 videos with 20 captions each (90%/10% train/validation split) to evaluate the numerical stability of Slot Attention under heterogeneous multimodal inputs. Images were resized to
for the CLIP encoder.
4.1.2. MiniGrid Experiments
We used MiniGrid [
32], which is a 2D grid-world environment for goal-oriented navigation and instruction-following tasks [
33]. Experiments were conducted across several configurations: Empty (
,
), DoorKey (
,
), and MultiRoom (N2–N4, S4–S5). Each configuration was trained with a fixed random seed; IMPALA baselines used 3 seeds with standard deviation reported. Multi-seed evaluation of the proposed method with error bars is reported for the full architecture experiments (
Section 4.3.5).
Table 3 shows the model configuration. The LLM mission parser uses Qwen2.5-3B-Instruct (Alibaba Cloud, Hangzhou, China)with a regex-based fallback.
Table 4 shows that the trainable parameters constitute only 0.4% of the total with the frozen CLIP encoder accounting for 99.6%.
4.2. Slot Attention Stability
Table 5 shows that the four stabilization techniques (Mentioned briefly in
Section 3.2 for vision encoder) progressively eliminate NaN occurrences. The full set reduces the NaN rate from 23.5% to 0%, which is a prerequisite for the downstream Flow Matching and FNO fusion layers—if Slot Attention produces NaN, the entire pipeline collapses.
Table 6 confirms that the stabilized CLIP + Slot Attention encoder achieves the lowest cross-modal alignment loss, validating that the stabilization does not compromise representation quality.
4.3. MiniGrid Results
4.3.1. Overall Performance
Table 7 summarizes results across MiniGrid configurations. The framework achieves its strongest results on MultiRoom-N4-S4 (93.0% success) and N2-S4 (94.0%), demonstrating effective navigation through up to 4 interconnected rooms.
Two patterns are notable. First, room size 4 is consistently easier than size 5 (N4-S4: 93% vs. N4-S5: 36%), suggesting that the language reward shaping (“traverse rooms to reach the goal”) provides sufficient guidance when each room is small enough for the agent to observe the door and goal simultaneously. The N4-S5 from-scratch experiment (0% at 2000 episodes) confirms that curriculum learning is essential for complex multi-room configurations: the 35.7% achieved with curriculum initialization from N2-S4 cannot be reached by training from scratch within the same budget. Second, DoorKey-8 × 8 plateaus at 20%, indicating that the current framework struggles with tasks requiring key acquisition followed by door opening—a two-stage compositional skill that may benefit from hierarchical planning not included in the current architecture.
4.3.2. Comparison with PPO Baseline
Table 8 shows that the integrated framework improves over standard PPO by +44.6 pp on N2-S4 and +31.4 pp on N3-S4. The PPO baseline uses
FlatObsWrapper (symbolic observations), whereas our method operates on pixel-level visual inputs processed through CLIP, making the comparison conservative: our method achieves higher success rates despite receiving a harder input modality.
4.3.3. Ablation Study
Table 9 quantifies the contribution of each component by removing one at a time from the full system. Three observations connect directly to the design rationale. Removing the frozen CLIP encoder (−19.8 pp) causes the largest performance drop, confirming that internet-scale pretrained visual representations are the most critical component; the CLIP encoder brings knowledge that cannot be learned from MiniGrid data alone (
Section 3). Removing language reward shaping (−8.0 pp) degrades performance substantially, confirming that the LLM-parsed mission structure provides essential dense supervision in the sparse-reward MultiRoom environment. Replacing PPO with SAC (−34.9 pp) yields near-zero success, confirming that categorical PPO is appropriate for discrete action spaces.
4.3.4. Literature-Based Positioning
Table 10 contextualizes our results against other methods reported in the literature. These comparisons are
indicative, not definitive, as experimental conditions differ across papers.
DreamerV3—despite its success across 150+ domains including discrete-action Atari—converges to suboptimal policies on MiniGrid [
36], where IMPALA outperforms it. To strengthen our comparison, we reproduced IMPALA under identical conditions (FlatObsWrapper, 2000 episodes, MLP with two 256-unit hidden layers, 3 seeds). IMPALA achieved 43.0 ± 7.8% on N2-S4—below the PPO reference value of 65%—and failed entirely on N4-S5 and N6 (0% across all seeds), confirming that model-free methods without external memory or planning mechanisms cannot solve multi-room navigation beyond the simplest configuration. FreamerV1’s 94% success rate on N2-S4 (+51 percentage points over IMPALA) and 93% on N4-S4 demonstrate the effectiveness of the integrated architecture.
4.3.5. Full Architecture Evaluation
To isolate the contribution of the filter-before-mixing pipeline (per-modality FM, FNO spectral fusion, DNC memory) from the modality-specific encoder (Layer 1), we evaluate multiple configurations on MultiRoom-N2-S4.
To provide a controlled comparison with the PPO baseline in
Section 4.3.2, the Layer 1-only baseline in
Table 11 uses the identical CLIP encoder, observation pipeline, reward structure, and PPO hyperparameters as FreamerV1, differing only in the absence of the filter-before-mixing pipeline. This controlled baseline achieves 78% at 5000 episodes compared to 94.7 ± 4.5% for FreamerV1 + OGM-GE under exactly the same training budget and evaluation protocol.
Table 11 and
Figure 2 reveal two important findings. First, FreamerV1 converges more slowly than Layer 1 alone (79% vs. 94% at 2000 episodes) due to the additional 1.4M parameters in the filter-before-mixing pipeline. However, with continued training, FreamerV1 reaches 87.7 ± 8.2% at 5000 episodes (one seed reaching 100%), surpassing the Layer 1-only baseline. Second, Layer 1 alone
degrades from 94% to 78% with continued training—a clear instance of catastrophic forgetting visible in the learning curve (
Figure 2, red line). The filter-before-mixing pipeline prevents this degradation: per-modality FM stabilizes the learned representations by enforcing modality-specific structure, while the DNC provides episodic recall that anchors the policy against distributional shift in the replay buffer.
The removal of DNC has minimal impact at 2000 episodes (80% vs. 79%), which is consistent with the observation that N2-S4 rooms are visited sequentially and episodic memory provides limited benefit at short training horizons.
The manual
initialization (vision:
, state:
, language:
, reward:
) requires domain knowledge and produces high seed-to-seed variance (80–99%). To address this, we adopt OGM-GE (On-the-fly Gradient Modulation with Gradient Enhancement) [
26], which dynamically modulates the
gradients based on per-modality FM loss progress. Modalities with slower-converging FM receive enhanced gradients (gate opens faster), while faster-converging modalities are suppressed (gate stays closed). We apply OGM-GE to the
parameter only (Mode A) with a learning rate amplification of
, rather than to all FM parameters (Mode B), as the latter destabilizes the velocity field and degrades performance (
Appendix K). With OGM-GE Mode A, FreamerV1 achieves 94.7 ± 4.5% (3 seeds), improving both the mean (+7 pp over manual
) and reducing seed-to-seed variance (±4.5% vs. ±8.2%), eliminating the need for per-modality hyperparameter tuning.
4.3.6. Encoder-Ablated Component Analysis
To isolate the contribution of each component without the confounding effect of the strong CLIP encoder, we repeat the ablation with a randomly initialized CNN encoder (
Table 12).
Three findings emerge. First, the full pipeline (FM + FNO + DNC) improves over the CNN-only baseline by +9 pp (55% vs. 46%), confirming that the filter-before-mixing principle contributes independently of the encoder quality. Second, FNO alone
degrades performance below the baseline (29% vs. 46%), which is in stark contrast to the CLIP setting where FNO alone reaches 97%. This demonstrates that the Layer 3 FNO requires denoised input from Layer 2 FM to function effectively: when fed noisy CNN embeddings directly, the FNO’s spectral convolution amplifies rather than resolves cross-modal noise. Third, comparing the CLIP and CNN settings reveals the PDE-guidance role of the post-FM FNO (
Section 3.4): when FM trajectories are nearly identical (CLIP, low noise), the FNO’s fusion role is sufficient; when FM trajectories diverge substantially (CNN, high noise), the FNO must additionally resolve heterogeneous refinement paths, and this resolution fails without prior denoising.
4.4. Noise Robustness: Cross-Modal Contamination
To directly test the filter-before-mixing claim that per-modality FM prevents cross-modal noise contamination, we inject Gaussian noise (
) into a single modality’s embedding at evaluation time and measure the effect on policy success rate (
Table 13).
FreamerV1 maintains 94–100% success across all noise levels and modalities, demonstrating that per-modality FM effectively absorbs injected noise before it reaches the FNO fusion layer. In contrast, No-FM achieves only 28–50% even at , and it shows no systematic degradation with increasing noise—the monolithic fusion has already conflated modality-specific noise during training, making additional injection indistinguishable.
This result provides the direct empirical evidence for the information-theoretic argument of
Section 3.3: pre-fusion denoising preserves policy-relevant information that post-fusion approaches irreversibly lose. The per-modality residual gate mechanism (Equation (
3)) ensures that noise in one modality cannot propagate to others through shared fusion weights.
Figure A4 provides visual confirmation: under
noise injection, the reward modality (gate = 0.26) shows the largest FM displacement (1.085) and variance reduction, while the vision modality (gate = 0.007) remains virtually unchanged, confirming modality-adaptive denoising.
4.5. Transfer Learning
For the N3-S4 environment, a model pretrained on N2-S4 was used as initialization, which was followed by continued training. This improved the success rate from 28.0% (training from scratch) to 46.0% (transfer + fine-tuning), demonstrating that the modular architecture learns representations that transfer across environments of different complexity.
4.6. Crafter Experiments
To evaluate the architecture beyond MiniGrid, we apply it to Crafter [
38]—a procedurally generated open-world environment with 22 hierarchically structured achievements (e.g., collect wood→place table→make pickaxe→collect stone). Crafter differs from MiniGrid in two critical ways: (1) it provides no language instructions, so the language modality line receives dummy input and language reward shaping is unavailable; (2) the observation is a
RGB image requiring visual understanding of a complex, procedurally generated world.
The architecture uses three active modality lines (state: 16-dimensional inventory, vision: CLIP-encoded image, reward) with the language line disabled. We train for environment steps on a single GPU (RTX 4090 (NVIDIA, Santa Clara, CA, USA), ∼12 h).
Table 14 summarizes the results. The agent unlocks 16–17 of 22 achievements within
steps, including intermediate crafting chains (place table, make wood pickaxe, collect stone, place furnace, make stone sword). The average per-episode achievement count of 4.5 indicates that the agent consistently executes multi-step plans rather than merely achieving each item once by chance.
Using the official Crafter score (
), FreamerV1 achieves 16.2 ± 0.8% (10 seeds, with achievement reward shaping) compared to DreamerV3’s 14.5% and PPO’s 4.6%. The +1.7 pp improvement over DreamerV3 is consistent across all 10 seeds (minimum 14.8%, maximum 17.6%), and FreamerV1 achieves this without any language modality—the language line receives dummy input and no language reward shaping is applied. This confirms that per-modality encoding with FNO spectral fusion provides competitive performance even when a primary modality is absent, and it suggests that adding language instructions (e.g., achievement descriptions as mission text) could further improve exploration efficiency. We note that EMERALD, a masked latent transformer-based world model, achieves 58.1% on Crafter but requires
the training budget (
steps). Per-seed results are provided in
Appendix H.
7. Conclusions
This paper proposed a design principle for multimodal reinforcement learning—filter before mixing—in which each modality’s representation is denoised by a dedicated Flow Matching module before cross-modal fusion via a Fourier Neural Operator in the spectral domain. This principle addresses the problem of cross-modal noise contamination that arises when heterogeneous modalities with different noise profiles are fused in a shared backbone, which is standard practice in monolithic VLA architectures such as 0.
We instantiated this principle in FreamerV1, which is a four-layer modular architecture integrating a frozen CLIP encoder with Slot Attention, per-modality Flow Matching, FNO spectral guidance, DNC episodic memory, and LLM-based language reward shaping. The framework achieves competitive performance with 0.37 M trainable parameters—two orders of magnitude smaller than 0’s 3B—by exploiting the modular structure to freeze pretrained components and train only the integration layers.
Experiments in three domains validated the approach. On MiniGrid navigation tasks, the full architecture (FM + FNO + DNC) reached 100% success on MultiRoom-N2-S4 at 5000 episodes, surpassing the 94% ceiling of the Layer 1-only baseline, though at the cost of slower initial convergence. On Crafter, an open-world environment without language instructions, the agent slightly surpassed DreamerV3’s official score (16.2 ± 0.8% vs. 14.5%; 10 seeds) using only three of four modality lines, demonstrating graceful degradation with missing modalities. On the PAMAP2 wearable-sensor health management task, the foundation encoder with per-modality Flow Matching, FNO fusion, and DNC memory achieved 2.4× higher cumulative reward ( vs. ), 27% fewer safety violations, and 16× lower cross-seed variance compared to a vanilla MLP-based RSSM world model. The dissociation between reconstruction accuracy and policy quality—the MLP encoder reconstructs observations better but produces worse policies—provides empirical support for the information-theoretic argument that pre-fusion denoising preserves policy-relevant information.
The health management application demonstrates that world model-based RL, which has seen remarkable success in games and robotics, can be extended to wearable-sensor health intervention with minimal architectural modification. The RSSM world model enables imagination-based policy optimization in the latent space, avoiding the ethical and practical difficulties of exposing real patients to dangerous physiological states during training. KL-divergence-based anomaly detection provides an additional safety layer for real-time physiological monitoring.
The adoption of OGM-GE for adaptive per-modality gate control (Mode A: alpha-only, lr amplification) improves both mean performance (+7 pp) and reproducibility (±4.5% vs. ±8.2%), demonstrating that the filter-before-mixing principle benefits from data-driven modality balancing while requiring careful restriction of gradient modulation to gate parameters only.
Several directions remain for future work. First, controlled comparisons against DreamerV3 on MiniGrid under identical conditions would further strengthen the experimental claims; preliminary results with the NM512 PyTorch reimplementation are ongoing. Second, extension to continuous-control robotic tasks and 3D environments would test the generality of the filter-before-mixing principle beyond discrete action spaces. Third, clinical validation of the health management application with domain experts and real patient outcomes is essential before deployment. Finally, the modest contribution of the FNO and DNC components in MiniGrid—where temporal delays between modalities are limited—motivates evaluation in domains with richer cross-modal dynamics, where the phase-shift capabilities of spectral-domain fusion are expected to be more fully realized. A direct experimental comparison of pre-fusion vs. post-fusion denoising (e.g., applying FM after FNO fusion rather than before) would further strengthen the information-theoretic argument for the filter-before-mixing principle.
Limitations
We acknowledge several limitations. For MiniGrid baselines, we conducted a controlled comparison against IMPALA under identical conditions, confirming the substantial performance gap (
Table 10). The Layer 1-only baseline (
Table 11) provides a controlled PPO comparison under identical training conditions, using the same encoder, reward structure, and training budget, isolating the contribution of the filter-before-mixing pipeline. The SB3 Zoo PPO values in
Table 8 use
FlatObsWrapper and may differ in hyperparameters; a fully controlled DreamerV3 comparison remains for future work.
Regarding environment scope, MiniGrid is a 2D grid-world with discrete observations, and transfer to 3D environments, continuous-control tasks, and real robotic systems is unvalidated.
The ablation study removes one component at a time, so pairwise interaction effects (e.g., whether FM helps more or less when DNC is present) are not characterized.
For statistical reporting, the initial MiniGrid experiments (
Table 7,
Table 8 and
Table 9) and the full architecture evaluation (
Table 11) report single-seed results; multi-seed evaluation is ongoing. The IMPALA comparison uses three seeds with standard deviations, and the PAMAP2 encoder comparison (
Table 15) uses three seeds.
The health management demonstration uses simulated rewards based on physiological heuristics, and clinical validation with domain experts is essential before deployment.
Finally, the FNO guidance layer and DNC memory show modest contributions in MiniGrid (
Table 9), where the environment lacks strong temporal delays and long-horizon dependencies; their full potential is hypothesized to emerge in more complex domains, which remains to be validated.