This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models
by
Muen Cai
Muen Cai 1,
Yuan Shen
Yuan Shen 2
,
Xiong Luo
Xiong Luo 3
and
Jian Hu
Jian Hu 1,*
1
School of Computer Science and Engineering, University of Electronic Science and Technology of China, No. 2006, Xiyuan Ave, Chengdu 611731, China
2
Meta Platforms Inc., Menlo Park, CA 94025, USA
3
Department of Information Technology, Uppsala University, 752 37 Uppsala, Sweden
*
Author to whom correspondence should be addressed.
Entropy 2026, 28(6), 585; https://doi.org/10.3390/e28060585 (registering DOI)
Submission received: 21 April 2026
/
Revised: 18 May 2026
/
Accepted: 19 May 2026
/
Published: 24 May 2026
Abstract
Large language models (LLMs) remain vulnerable to jailbreak attacks, especially in black-box settings where target-model gradients and internal tokenization are inaccessible. Recent information bottleneck-based defenses cast prompt protection as a compression problem, but existing methods still rely heavily on white-box optimization and the intrinsic alignment strength of the protected model. To address these limitations, we propose RIB-Guard, a safety-aware information bottleneck defense for black-box LLMs. RIB-Guard learns a token-level masking policy that extracts a minimally safety-sufficient prompt via reinforcement learning using only black-box feedback. In addition, it introduces an independent lightweight safety head to estimate residual jailbreak risk and provide model-agnostic safety guidance during training. The proposed framework jointly balances prompt compactness, benign utility preservation, and residual risk suppression within a unified objective. Experimental results on direct single-turn harmful and benign prompt settings show that RIB-Guard improves jailbreak robustness while maintaining competitive benign utility. By extending information bottleneck-based prompt protection from white-box to black-box settings, RIB-Guard provides a step toward safety-aware information-theoretic front-end defense for black-box LLMs.
Share and Cite
MDPI and ACS Style
Cai, M.; Shen, Y.; Luo, X.; Hu, J.
RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models. Entropy 2026, 28, 585.
https://doi.org/10.3390/e28060585
AMA Style
Cai M, Shen Y, Luo X, Hu J.
RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models. Entropy. 2026; 28(6):585.
https://doi.org/10.3390/e28060585
Chicago/Turabian Style
Cai, Muen, Yuan Shen, Xiong Luo, and Jian Hu.
2026. "RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models" Entropy 28, no. 6: 585.
https://doi.org/10.3390/e28060585
APA Style
Cai, M., Shen, Y., Luo, X., & Hu, J.
(2026). RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models. Entropy, 28(6), 585.
https://doi.org/10.3390/e28060585
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article Access Statistics
For more information on the journal statistics, click
here.
Multiple requests from the same IP address are counted as one view.