Robust Offline Reinforcement Learning Through Causal Feature Disentanglement
Abstract
1. Introduction
- Learnable Feature Disentanglement Mechanism: Integrating the CausalVAE framework to disentangle the corrupted data into causal and non-causal features (vulnerable to interference) provides formal guarantees of identifiability through structured causal models under idealized assumptions. Theoretically, this disentanglement confers a robustness advantage under data corruption conditions;
- Causal-Preserving Perturbation Training: This is conducted by generating samples and applying gaussian perturbations to non-causal features via counterfactual interventions while designing dual-path feature alignment loss and contrastive loss to enforce perturbation-invariant causal representations;
- Dynamic Graph Diagnosis and Reconstruction: This is conducted by employing graph convolutional attention networks to model spatiotemporal causal relationships among state variables, identifying corruption-sensitive edges through graph structure consistency loss to enable precise corrupted-data repair and smooth dynamic recovery.
2. Related Works
2.1. Offline Reinforcement Learning
2.2. Data Corruption Reinforcement Learning
2.3. Disentangled Reinforcement Learning
3. Preliminarys
3.1. Offline Reinforcement Learning
3.2. Causal Inference
3.3. Data Corruption
4. Methods
4.1. Causal Feature Disentanglement
4.2. Causality-Preserving Perturbation Mechanism
4.3. Dynamic Causal Graph Diagnosis and Reconstruction
4.4. Multi-Objective Optimization
5. Experiment
5.1. Environment Setting
5.2. Evaluation Under Random Corruption
5.3. Evaluation Under Adversarial Corruption
5.4. Ablation Experiments
5.4.1. Ablation Study on Different Modules
5.4.2. Ablation Study on Corruption Rates
5.4.3. Ablation Study on Hyperparameters
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
RCFD | Robust Invariant Feature Disentanglement |
RL | Reinforcement Learning |
GCAN | Graph Convolution-Attention hybrid Network |
DAG | Directed Acyclic Graph |
OOD | Out of Distribution |
DT | Decision Transformer |
MDP | Markov Decision Process |
SCM | Structured Causal Model |
Appendix A. Proof of Theorem
Appendix A.1. Variational Lower Bound (ELBO) Derivation
Appendix A.2. Markovian Assumption
Appendix A.3. KL Divergence Decomposition
Appendix A.4. Detailed Derivation of Theoretical Analysis for Causal Disentanglement Robustness
Appendix A.4.1. Distributional Shift Analysis Under Corruption
Appendix A.4.2. Bound Analysis of Policy Performance Loss
Appendix A.4.3. Quantitative Analysis of Robustness Gains
Appendix A.4.4. Asymptotic Robustness Guarantee
Appendix A.5. Identifiability Constraints
Appendix B. Environment Setting
Appendix B.1. Data Corruption Details
- Random observation attack: We randomly sample a fraction c (corruption rate) of state transition tuples from the dataset and apply Gaussian noise perturbations to the selected states (), where , represents state dimensionality, denotes the dimensional standard deviation of all states in the dataset, and the corruption scale controls corruption intensity.
- Random action attack: Using the same sampling rate , we select state transition samples and apply perturbations to the actions , where , is the action dimensionality and represents the dimensional standard deviation of the action space.
- Random reward attack: For randomly selected samples, we replace original rewards with . The amplification factor is employed because offline reinforcement learning algorithms exhibit natural resistance to small-scale reward perturbations but experience performance collapse under large-scale corruption conditions.
- Adversarial observation attack: We first pre-train an EDAC agent on clean data to obtain the Q-network and the policy . Subsequently, we implement gradient optimization attacks on a selected proportion of states, , where limits the maximum deviation for each state dimension. The optimization process employs Projected Gradient Descent with 100 iterations and a learning rate of 0.01 and clips the perturbation vector to the range after each update.
- Adversarial action attack: Utilizing the pre-trained agent, we implement similar attacks on the actions , where , with the optimization strategy consistent with state attacks.
- Adversarial reward attack: For reward signals, we adopt the following direct inversion strategy: . This method is based on the optimal solution properties of the minimization objective .
Appendix B.2. Tasks in D4RL
Task | Allowed Steps | Samples |
---|---|---|
Halfcheetah-medium-replay-v2 | 101,000 | |
Hopper-medium-replay-v2 | 200,920 | |
Walker2d-medium-replay-v2 | 100,930 |
Appendix B.3. Hyperparameters
Parameter | Setting |
---|---|
Buffer size | 2 × 106 |
Batch size | 256 |
Intervention prob | 0.2 |
Intervention noise scale | 0.1 |
0.5 | |
0.5 | |
0.7 | |
corruption range | 1.0 |
corruption rate | 0.3 |
References
- Prudencio, R.F.; Maximo, M.R.; Colombini, E.L. A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 10237–10257. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Chen, Y.; Zhu, X.; Sun, W. Corruption-robust offline reinforcement learning. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, Virtual Conference, 28–30 March 2022; pp. 5757–5773. [Google Scholar]
- Ye, C.; Xiong, W.; Gu, Q.; Zhang, T. Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 39834–39863. [Google Scholar]
- Ding, W.; Shi, L.; Chi, Y.; Zhao, D. Seeing is not believing: Robust reinforcement learning against spurious correlation. Adv. Neural Inf. Process. Syst. 2023, 36, 66328–66363. [Google Scholar]
- Yang, R.; Zhong, H.; Xu, J.; Zhang, A.; Zhang, C.; Han, L.; Zhang, T. TOWARDS ROBUST OFFLINE REINFORCEMENT LEARNING UNDER DIVERSE DATA CORRUPTION. In Proceedings of the 12th International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Yang, R.; Wang, J.; Wu, G.; Li, B. Uncertainty-based Offline Variational Bayesian Reinforcement Learning for Robustness under Diverse Data Corruptions. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 39748–39783. [Google Scholar]
- Xu, J.; Yang, R.; Qiu, S.; Luo, F.; Fang, M.; Wang, B.; Han, L. Tackling Data Corruption in Offline Reinforcement Learning via Sequence Modeling. arXiv 2025, arXiv:2407.04285. [Google Scholar] [CrossRef]
- Mondal, A.; Mishra, D.; Prasad, G.; Hossain, A. Joint Optimization Framework for Minimization of Device Energy Consumption in Transmission Rate Constrained UAV-Assisted IoT Network. IEEE Internet Things J. 2022, 9, 9591–9607. [Google Scholar] [CrossRef]
- Chen, C.; Wang, Y.; Munir, N.S.; Zhou, X.; Zhou, X. Revisiting Adversarial Perception Attacks and Defense Methods on Autonomous Driving Systems. In Proceedings of the 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), Naples, Italy, 23–26 June 2025; pp. 242–249. [Google Scholar] [CrossRef]
- Cao, L. AI in Finance: Challenges, Techniques, and Opportunities. ACM Comput. Surv. 2022, 55, 64. [Google Scholar] [CrossRef]
- Galli, F. Algorithmic Manipulation. In Algorithmic Marketing and EU Law on Unfair Commercial Practices; Springer International Publishing: Cham, Switzerland, 2022; pp. 209–259. [Google Scholar] [CrossRef]
- Sun, X.; Meng, X. A Robust Control Approach to Event-Triggered Networked Control Systems With Time-Varying Delays. IEEE Access 2021, 9, 64653–64664. [Google Scholar] [CrossRef]
- Zhang, D.; Wang, Y.; Meng, L.; Yan, J.; Qin, C. Adaptive critic design for safety-optimal FTC of unknown nonlinear systems with asymmetric constrained-input. ISA Trans. 2024, 155, 309–318. [Google Scholar] [CrossRef]
- Zhang, D.; Hao, X.; Liang, L.; Liu, W.; Qin, C. A novel deep convolutional neural network algorithm for surface defect detection. J. Comput. Des. Eng. 2022, 9, 1616–1632. [Google Scholar] [CrossRef]
- Tang, M.; Cai, S.; Lau, V.K.N. Online System Identification and Optimal Control for Mission-Critical IoT Systems Over MIMO Fading Channels. IEEE Internet Things J. 2022, 9, 21157–21173. [Google Scholar] [CrossRef]
- Fujimoto, S.; Meger, D.; Precup, D. Off-policy deep reinforcement learning without exploration. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2052–2062. [Google Scholar]
- Kumar, A.; Fu, J.; Soh, M.; Tucker, G.; Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Wu, Y.; Tucker, G.; Nachum, O. Behavior Regularized Offline Reinforcement Learning. arXiv 2019, arXiv:1911.11361. [Google Scholar] [CrossRef]
- Ma, Y.; Jayaraman, D.; Bastani, O. Conservative offline distributional reinforcement learning. Adv. Neural Inf. Process. Syst. 2021, 34, 19235–19247. [Google Scholar]
- Xu, H.; Jiang, L.; Li, J.; Yang, Z.; Wang, Z.; Chan, V.W.K.; Zhan, X. Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Liu, J.; Zhang, Z.; Wei, Z.; Zhuang, Z.; Kang, Y.; Gai, S.; Wang, D. Beyond ood state actions: Supported cross-domain offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 13945–13953. [Google Scholar]
- Wang, D.; Li, L.; Wei, W.; Yu, Q.; Hao, J.; Liang, J. Improving Generalization in Offline Reinforcement Learning via Latent Distribution Representation Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 21053–21061. [Google Scholar]
- Kumar, A.; Zhou, A.; Tucker, G.; Levine, S. Conservative Q-Learning for Offline Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1179–1191. [Google Scholar]
- Kostrikov, I.; Nair, A.; Levine, S. Offline Reinforcement Learning with Implicit Q-Learning. In Proceedings of the International Conference on Learning Representations, Virtually, 25–29 April 2022. [Google Scholar]
- Chebotar, Y.; Vuong, Q.; Hausman, K.; Xia, F.; Lu, Y.; Irpan, A.; Kumar, A.; Yu, T.; Herzog, A.; Pertsch, K.; et al. Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions. In Proceedings of the 7th Annual Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
- Zheng, Y.; Li, J.; Yu, D.; Yang, Y.; Li, S.E.; Zhan, X.; Liu, J. Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Huang, L.; Dong, B.; Zhang, W. Efficient Offline Reinforcement Learning With Relaxed Conservatism. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5260–5272. [Google Scholar] [CrossRef]
- Ye, C.; Xiong, W.; Gu, Q.; Zhang, T. Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes. arXiv 2022, arXiv:2212.05949. [Google Scholar] [CrossRef]
- Mandal, D.; Nika, A.; Kamalaruban, P.; Singla, A.; Radanovic, G. Corruption Robust Offline Reinforcement Learning with Human Feedback. arXiv 2024, arXiv:2402.06734. [Google Scholar] [CrossRef]
- Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision Transformer: Reinforcement Learning via Sequence Modeling. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 15084–15097. [Google Scholar]
- Gmelin, K.; Bahl, S.; Mendonca, R.; Pathak, D. Efficient RL via Disentangled Environment and Agent Representations. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 11525–11545. [Google Scholar]
- Dunion, M.; McInroe, T.; Luck, K.; Hanna, J.; Albrecht, S. Conditional mutual information for disentangled representations in reinforcement learning. Adv. Neural Inf. Process. Syst. 2023, 36, 80111–80129. [Google Scholar]
- Yang, M.; Liu, F.; Chen, Z.; Shen, X.; Hao, J.; Wang, J. CausalVAE: Structured Causal Disentanglement in Variational Autoencoder. arXiv 2020, arXiv:2004.08697. [Google Scholar] [CrossRef]
- Robins, J.M. Discussion of Causal diagrams for empirical research by J. Pearl. Biometrika 1995, 82, 695–698. [Google Scholar] [CrossRef]
- Neuberg, L.G. Causality: Models, reasoning, and Inference, by Judea Pearl, Cambridge University Press, 2000. Econom. Theory 2003, 19, 675–685. [Google Scholar] [CrossRef]
- The Book of Why: The New Science of Cause and Effect. Science 2018, 361, 855. [CrossRef]
- Raghavan, A.; Bareinboim, E. Counterfactual Realizability. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
- Fu, J.; Kumar, A.; Nachum, O.; Tucker, G.; Levine, S. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv 2020, arXiv:2004.07219. [Google Scholar] [CrossRef]
- An, G.; Moon, S.; Kim, J.H.; Song, H.O. Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 7436–7447. [Google Scholar]
- Yang, Y.; Huang, B.; Tu, S.; Xu, L. Boosting Efficiency in Task-Agnostic Exploration through Causal Knowledge. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Jeju, Republic of Korea, 3–9 August 2024; pp. 5344–5352. [Google Scholar]
Environment | Attack Element | CQL | IQL | Causal | RIQL | RDT | RCFD |
---|---|---|---|---|---|---|---|
halfcheetah | obs | 12.8 ± 2.2 | 16.4 ± 3.8 | 11.7 ± 1.4 | 21.6 ± 3.5 | 19.9 ± 3.3 | 29.4 ± 3.7 |
acts | 43.8 ± 2.7 | 43.7 ± 1.2 | 39.9 ± 2.9 | 41.1 ± 2.4 | 30.1 ± 3.2 | 40.0 ± 2.4 | |
reward | 39.1 ± 6.1 | 39.1 ± 0.5 | 42.0 ± 1.1 | 42.5 ± 1.6 | 34.6 ± 3.6 | 40.4 ± 2.5 | |
walker2d | obs | 24.9 ± 16.3 | 24.9 ± 7.3 | 20.9 ± 10.7 | 33.7 ± 13.1 | 53.1 ± 4.8 | 44.7 ± 6.1 |
acts | 32.5 ± 27.1 | 77.9 ± 1.5 | 78.1 ± 7.4 | 81.3 ± 5.0 | 63.1 ± 4.9 | 71.4 ± 5.5 | |
reward | 58.5 ± 23.5 | 59.6 ± 13.6 | 81.4 ± 4.1 | 56.7 ± 12.7 | 53.3 ± 17.2 | 73.1 ± 8.5 | |
hopper | obs | 40.7 ± 11.4 | 44.1 ± 22.4 | 40.1 ± 9.3 | 23.3 ± 8.8 | 65.6 ± 3.2 | 84.3 ± 7.3 |
acts | 65.3 ± 34.3 | 64.7 ± 18.6 | 56.4 ± 9.5 | 66.3 ± 7.3 | 67.3 ± 12.3 | 71.6 ± 4.6 | |
reward | 60.5 ± 12.8 | 62.7 ± 16.5 | 48.1 ± 2.4 | 66.0 ± 21.2 | 66.4 ± 11.3 | 72.5 ± 14.8 | |
mean | 42.0 | 47.2 | 46.5 | 48.9 | 50.4 | 58.6 |
Environment | Attack Element | CQL | IQL | Causal | RIQL | RDT | RCFD |
---|---|---|---|---|---|---|---|
halfcheetah | obs | 25.3 ± 22.3 | 23.7 ± 2.6 | 12.4 ± 3.7 | 31.3 ± 4.8 | 30.2 ± 4.0 | 19.9 ± 3.9 |
acts | 33.6 ± 14.4 | 37.5 ± 1.6 | 32.8 ± 3.7 | 31.4 ± 4.5 | 24.8 ± 2.9 | 30.5 ± 7.2 | |
reward | 43.7 ± 0.4 | 42.5 ± 2.0 | 44.1 ± 0.9 | 43.5 ± 1.5 | 34.2 ± 2.7 | 43.0 ± 0.2 | |
walker2d | obs | 69.8 ± 4.9 | 27.4 ± 6.1 | 23.6 ± 12.7 | 28.7 ± 18.9 | 49.7 ± 2.3 | 66.7 ± 11.7 |
acts | 29.5 ± 6.8 | 47.4 ± 8.1 | 39.1 ± 12.4 | 57.8 ± 5.6 | 38.6 ± 10.1 | 55.8 ± 11.3 | |
reward | 47.3 ± 21.7 | 68.9 ± 6.0 | 85.1 ± 1.3 | 75.6 ± 8.2 | 63.9 ± 1.9 | 80.7 ± 1.1 | |
hopper | obs | 74.6 ± 11.8 | 41.4 ± 6.4 | 35.5 ± 6.6 | 41.2 ± 7.4 | 60.3 ± 12.3 | 46.6 ± 13.9 |
acts | 48.9 ± 24.9 | 61.7 ± 9.8 | 49.8 ± 15.0 | 53.5 ± 28.5 | 36.0 ± 14.5 | 76.2 ± 19.9 | |
reward | 25.2 ± 3.4 | 59.8 ± 9.3 | 51.5 ± 6.1 | 60.5 ± 9.4 | 67.4 ± 11.7 | 71.9 ± 9.9 | |
mean | 44.2 | 45.6 | 41.6 | 47.1 | 45.0 | 54.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, A.; Li, P.; Su, X. Robust Offline Reinforcement Learning Through Causal Feature Disentanglement. Electronics 2025, 14, 4064. https://doi.org/10.3390/electronics14204064
Ma A, Li P, Su X. Robust Offline Reinforcement Learning Through Causal Feature Disentanglement. Electronics. 2025; 14(20):4064. https://doi.org/10.3390/electronics14204064
Chicago/Turabian StyleMa, Ao, Peng Li, and Xiaolong Su. 2025. "Robust Offline Reinforcement Learning Through Causal Feature Disentanglement" Electronics 14, no. 20: 4064. https://doi.org/10.3390/electronics14204064
APA StyleMa, A., Li, P., & Su, X. (2025). Robust Offline Reinforcement Learning Through Causal Feature Disentanglement. Electronics, 14(20), 4064. https://doi.org/10.3390/electronics14204064