Previous Article in Journal
Ontic and Epistemic States in the Theory of Spacetime-Local Beables
Previous Article in Special Issue
Information-Theoretic Intrinsic Motivation for Reinforcement Learning in Combinatorial Routing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models

1
School of Computer Science and Engineering, University of Electronic Science and Technology of China, No. 2006, Xiyuan Ave, Chengdu 611731, China
2
Meta Platforms Inc., Menlo Park, CA 94025, USA
3
Department of Information Technology, Uppsala University, 752 37 Uppsala, Sweden
*
Author to whom correspondence should be addressed.
Entropy 2026, 28(6), 585; https://doi.org/10.3390/e28060585 (registering DOI)
Submission received: 21 April 2026 / Revised: 18 May 2026 / Accepted: 19 May 2026 / Published: 24 May 2026
(This article belongs to the Special Issue The Information Bottleneck Method: Theory and Applications)

Abstract

Large language models (LLMs) remain vulnerable to jailbreak attacks, especially in black-box settings where target-model gradients and internal tokenization are inaccessible. Recent information bottleneck-based defenses cast prompt protection as a compression problem, but existing methods still rely heavily on white-box optimization and the intrinsic alignment strength of the protected model. To address these limitations, we propose RIB-Guard, a safety-aware information bottleneck defense for black-box LLMs. RIB-Guard learns a token-level masking policy that extracts a minimally safety-sufficient prompt via reinforcement learning using only black-box feedback. In addition, it introduces an independent lightweight safety head to estimate residual jailbreak risk and provide model-agnostic safety guidance during training. The proposed framework jointly balances prompt compactness, benign utility preservation, and residual risk suppression within a unified objective. Experimental results on direct single-turn harmful and benign prompt settings show that RIB-Guard improves jailbreak robustness while maintaining competitive benign utility. By extending information bottleneck-based prompt protection from white-box to black-box settings, RIB-Guard provides a step toward safety-aware information-theoretic front-end defense for black-box LLMs.
Keywords: large language models; jailbreak defense; information bottleneck; reinforcement learning; prompt masking large language models; jailbreak defense; information bottleneck; reinforcement learning; prompt masking

Share and Cite

MDPI and ACS Style

Cai, M.; Shen, Y.; Luo, X.; Hu, J. RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models. Entropy 2026, 28, 585. https://doi.org/10.3390/e28060585

AMA Style

Cai M, Shen Y, Luo X, Hu J. RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models. Entropy. 2026; 28(6):585. https://doi.org/10.3390/e28060585

Chicago/Turabian Style

Cai, Muen, Yuan Shen, Xiong Luo, and Jian Hu. 2026. "RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models" Entropy 28, no. 6: 585. https://doi.org/10.3390/e28060585

APA Style

Cai, M., Shen, Y., Luo, X., & Hu, J. (2026). RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models. Entropy, 28(6), 585. https://doi.org/10.3390/e28060585

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop