With the rapid development of large language models (LLMs) and ever-evolving practical requirements, finding an efficient and effective alignment method has never been more critical. However, the tension between the complexity of current alignment methods and the need for rapid iteration in deployment scenarios necessitates the development of a model-agnostic alignment approach that can operate under these constraints. In this paper, we introduce Aligner, a novel and simple alignment paradigm that learns the correctional residuals between preferred and dispreferred answers using a small model. Designed as a model-agnostic, plug-and-play module, Aligner can be directly applied to various open-source and API-based models with only one-off training, making it suitable for rapid iteration. Notably, Aligner can be applied to any powerful, large-scale upstream models. Moreover, it can even iteratively bootstrap the upstream models using corrected responses as synthetic human preference data, breaking through the model's performance ceiling. Our experiments demonstrate performance improvements by deploying the same Aligner model across 11 different LLMs, evaluated on the 3H dimensions (helpfulness, harmlessness, and honesty). Specifically, Aligner-7B has achieved an average improvement of \(68.9\%\) in helpfulness and \(22.8\%\) in harmlessness across the tested LLMs while also effectively reducing hallucination. In the Alpaca-Eval leaderboard, stacking Aligner-2B on GPT-4 Turbo improved its LC Win Rate from \(55.0\%\) to \(58.3\%\), surpassing GPT-4 Omni's \(57.5\%\) Win Rate (community report).
#1: Multi-round RLHF training via Aligner
New Multi-round Training Pipeline
As a data augmentation tool, Aligner can enhance the upstream model's response \(A\) into an improved response \(A^*\), thereby forming a synthetic preference dataset. This dataset can be used to further train the upstream model via RLHF/DPO. Repeating this process allows for multi-round RLHF or DPO.
Performance
#2: A New Alternative to Inference-time Intervention Methods
We also conducted supplementary experiments with BoN and beam search as alternative inference-time intervention methods. Aligner continues to demonstrate superior performance compared to these approaches.
Performance
#3: Controllable Refinement
When refining the model's answer using Aligner, can providing feedback during the refinement process improve performance?
To explore this, we conduct a validation experiment where feedback is incorporated into Aligner's refinement process. Rather than performing additional fine-tuning, we integrated specific prompts as feedback during Aligner's refinement of the pre-aligned model's responses. In these experiments, Aligner was instructed to prioritize empathy, helpfulness, or harmlessness.
Performance
These experiments demonstrate that, once trained, Aligner can incorporate prompt-based feedback during refinement to achieve fine-grained adjustments. The above finding enhances Aligner's versatility and applicability in real-world scenarios. By using prompts, we can guide Aligner's refinement to achieve precise adjustments, such as in an Instruct-Aligner setup.
#4: Weak-to-Strong Correction via Aligner
As AI systems reach human-level performance across various tasks and undertake increasingly complex activities that are hard for humans to grasp, it becomes progressively challenging to provide ongoing, reliable feedback and ensure that their behaviors align with human intentions. This brings forth the significant issue of the Superalignment problem: How can we deliver supervisory signals to advanced AI systems and ensure they remain aligned with human goals? AI Alignment: a Comprehensive Survey Concrete problems in AI safety. . Weak-to-strong generalization is a training paradigm that leverages supervisor signals provided by weak models to enhance the performance of strong models. The weak to strong paper Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. has conducted preliminary trials in NLP classification, chess puzzles, and reward modeling tasks, observing positive gains by simply fine-tuning strong pre-trained models using pseudo-labels produced by weak models. This paradigm is analogous to the concept of "teaching" where the weak model instructs the strong one.
Illustration of our pipeline
Pipeline of Weak-to-Strong Correction
Performance
1.Aligner achieves noticeable alignment with less training data. For instance, with 50K training data, Aligner trained a 7B model, enhancing GPT-4's helpfulness by 19% and safety by 26%, and boosting Vicuna 33B's helpfulness and safety by 51% and 56%, respectively.
2.Aligner offers a simpler training process. It trains a Seq2Seq model for alignment, a more straightforward and manageable process than RLHF's complex reward model learning and RL fine-tuning. RLHF's engineering tuning intricacies and the inherent instability and hyperparameter sensitivity of RL algorithms make its implementation challenging, while Aligner's simplified approach substantially reduces complexity.
3.Unlike RLHF, Aligner does not require access to model weights. While RLHF is effective in model alignment, it depends on direct model training. The applicability of RLHF is limited with non-open-source API-based models and their specific downstream task fine-tuning requirements. In contrast, Aligner does not require direct manipulation of the model's original parameters. Aligner externalizes alignment needs to an independent module, offering a flexible method.
4.Aligner is not limited by model type. Under RLHF, fine-tuning different models like Llama2 Llama 2: Open foundation and fine-tuned chat models. , Alpaca Stanford alpaca: An instruction-following llama model. , or Vicuna Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality. requires re-collecting preference data and adjusting training parameters in the reward model training and RL phase. Aligner supports any model's alignment with just one-time training. For instance, one-time Aligner training in research improved helpfulness and safety for 11 different models, showcasing its wide generalization capabilities.
5.Aligner offers greater flexibility in training resource requirements. Fine-tuning a 70B model using RLHF demands significant computing resources, often needing hundreds of GPU cards. Specifically, RLHF requires loading not just the 70B parameter target model, but also additional reward, Actor, and Critic models with similar parameter sizes. Consequently, RLHF consumes more computing resources per unit time than pre-training. In contrast, Aligner's training strategy is more flexible, enabling users to select the scale of Aligner training based on their available computing resources. For instance, to align a 70B model, users can choose from various Aligner model scales, like 7B, 13B, or 70B, based on available resources. This flexibility reduces the overall demand for computing resources and enables efficient alignment even with limited resources. Thus, Aligner, adaptable in its training resource requirements, offers an effective, practical strategy for large-scale model alignment, providing various options for users or researchers under different resource constraints.