Skip Content
You are currently on the new version of our website. Access the old version .
MathematicsMathematics
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

30 January 2026

Weak-to-Strong Honesty Alignment via Group-Relative Policy Optimization

,
and
1
School of Cyber Science and Engineering Department, Wuhan University, Bayi Road, Wuhan 430072, China
2
Chinese PLA Center for Disease Control and Prevention, Beijing 100038, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue AI, Machine Learning and Optimization

Abstract

Ensuring that Large Language Models align with human values of honesty is a critical challenge, particularly due to the scarcity of labeled data for distinguishing known versus unknown knowledge boundaries. We propose a weak-to-strong generalization framework utilizing Group Relative Policy Optimization (GRPO). Unlike standard supervised fine-tuning or prompt engineering, our framework trains a lightweight “honest head” to rank response candidates based on multifaceted honesty scores. Crucially, we employ GRPO to optimize this head, leveraging group-relative advantages and PPO-style clipping to robustly learn from noisy, relative honesty signals. The weak honest head then guides the self-labeling of unlabeled data to fine-tune strong LLMs. Experiments on PopQA, SQuAD, Non-AmbigQA, and a domain-specific military medical dataset demonstrate that our framework significantly outperforms strong baselines, including Direct Preference Optimization (DPO), in honesty alignment.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.