Steering large language models (LLMs) toward desired behaviors while preserving privacy is a critical challenge in AI alignment. Existing differentially private (DP) steering methods, such as
PSA, add high-dimensional noise that can severely degrade steering accuracy. We propose
DP-JL, a novel
[...] Read more.
Steering large language models (LLMs) toward desired behaviors while preserving privacy is a critical challenge in AI alignment. Existing differentially private (DP) steering methods, such as
PSA, add high-dimensional noise that can severely degrade steering accuracy. We propose
DP-JL, a novel approach that combines Johnson–Lindenstrauss (JL) random projection with differential privacy to reduce noise while maintaining formal privacy guarantees.
DP-JL projects steering vectors into a lower-dimensional space (dimension
k) before adding DP noise, reducing total noise magnitude from
to
where
, while the privacy budget
remains unchanged. We evaluate
DP-JL on seven behavioral datasets with LLaMA-2-7B, Mistral-7B, Qwen2.5-7B, and Gemma-2-9B, alongside general capability benchmarks (MMLU, TruthfulQA). All accuracy values are measured on held-out test sets. Results show that
DP-JL achieves: (1) up to 22.76 percentage points higher steering accuracy than
PSA on the myopic-reward dataset (at fixed privacy budget
,
); (2) 91.7% win rate on sycophancy with an average accuracy improvement of 3.01 percentage points; (3) systematic advantages in high-privacy regimes (
); and (4) superior capability preservation on related tasks (TruthfulQA), achieving 6.6 percentage points better accuracy than
PSA. Furthermore, visualizations and layer-sensitivity analyses reveal that
DP-JL faithfully preserves the geometric structure of activation spaces, explaining its robustness. Our findings demonstrate that
DP-JL offers superior privacy–utility trade-offs while better preserving model capabilities.
Full article