Abstract
Ensuring that Large Language Models align with human values of honesty is a critical challenge, particularly due to the scarcity of labeled data for distinguishing known versus unknown knowledge boundaries. We propose a weak-to-strong generalization framework utilizing Group Relative Policy Optimization (GRPO). Unlike standard supervised fine-tuning or prompt engineering, our framework trains a lightweight “honest head” to rank response candidates based on multifaceted honesty scores. Crucially, we employ GRPO to optimize this head, leveraging group-relative advantages and PPO-style clipping to robustly learn from noisy, relative honesty signals. The weak honest head then guides the self-labeling of unlabeled data to fine-tune strong LLMs. Experiments on PopQA, SQuAD, Non-AmbigQA, and a domain-specific military medical dataset demonstrate that our framework significantly outperforms strong baselines, including Direct Preference Optimization (DPO), in honesty alignment.