Capability or Alignment? Respect the LLM Base Model’s Capability During Alignment

I started writing this post on Dec 28th, 2023, and finished it on Feb 14th, 2024. All the opinions are my own, not reflecting the view of my affiliation or any others. Actually, this post largely summarizes and elaborates my thoughts expressed on Twitter since March 21st, 2023, as demonstrated in the Appendix. I also added a lot more detailed thoughts from some early Anthropic and OpenAI papers and blog posts. Thanks Hongye Jin and Yijia Shao for providing useful feedback on the initial draft.

Last year saw a boom of LLMs research. Based on the research, one important lesson would be that we should devote most of our efforts to training a general-purpose LLM base model, and leverage it as much as possible after all. I might be opinionated, but I always believe that one general principle is that we need to respect the base model’s capability during alignment. This argument might be common sense among many people, but it is still controversial among others, especially when it comes to the boundary between capability and alignment. Feel free to correct me if you have more solid empirical evidence.

In this post, I will first define model capability and alignment respectively. Then I will discuss capability and alignment boundaries. I will also show some evidence on LLM capabilities coming from the base model and explain why. Based on this, I will introduce some principles to respect base model capability during each method of alignment. Finally, I emphasize the importance of evaluation used to show the effectiveness of our principle.

All the arguments are based on the goal that base model construction and alignment is to get a general purpose model, chatbot and A(G)I, or at least a specialist that behaves properly in real-world cases, instead of optimizing performance on any specific tasks and domains or performance on benchmarks.

What’s Alignment?
What capabilities does a strong base model have?
What’s the boundary between Capability and Alignment?
Evidence of LLM capabilities coming from the base model
Why do capabilities of LLMs mainly come from Base Models?
How to do Alignment?
Why do many evaluation results still seem good without following our principles?
What’s the implications and future research questions considering major capabilities come from base models?

What’s Alignment?

The alignment problem is defined as “How can we create agents that behave in accordance with the user’s intentions” [25]. In the context of LLM, “agent” could be the language model itself, or the LLM augmented with tools, memories and planning/reasoning abilities as a whole. The “intentions” could be explicit or implicit. Explicit intentions could be requirements expressed in the natural language instructions, while implicit intentions are numerous and hard to be captured by limited objectives or natural language, like “do not be toxic” etc. [26] Those implicit intents could also be ambiguous and diverse, or even inconsistent among different people. Many people currently classify them into three main alignment directions: Helpfulness (Implicit Intent and Explicit Instruction Following), Harmlessness (Safety) and Honesty (Truthfulness and less Hallucinations). In this sense, alignment is more like a goal instead of a method. To achieve the goal of alignment, there could be many methods. The most effective way might be finetuning, including SFT, RLHF and so on. But this goal can be also partially achieved by prompting, or even during the pretraining stage.

Note: specific task alignment is out of the scope of this post, because our goal is to align a generalist model.

What capabilities does a strong base model have?

There are some common capabilities that are growing when scaling models and corresponding data sizes, like world knowledge, reasoning, coding, multilinguality etc. There are also some other surprising capabilities that are correlated much with scaling of model and data sizes during pretraining. Only given a strong and large enough base model, such capabilities can show up with or without alignment. Given a weak and small base model, even with the same alignment technique, there is still no such capabilities as much. This could indicate that much of them mainly come from the base model instead of the alignment post-training.

For long-context modeling capabilities, some evidence showed that long-context LLMs can be effectively trained even with very limited long data [9, 36]. Surprisingly, inference time position mapping could even extend short-context models to relatively long contexts to some extent [10], which means that the base model could probably already have long context modeling capability. Continual training without proper data engineering might lead to collapse, losing some short-text performance, or overfitting to limited distribution of long sequences [36].
With larger and stronger base models, parts of alignment purpose are sometimes achieved inherently even better. For example, the larger model can do moral self-correction with appropriate prompts [11], thus leading to better safety alignment. Also, prompting alone could already align a base model to some extent if the base model is strong enough [1].
With larger base models, calibration and robustness is better. Some works show that larger and stronger base models could calibrate its responses better by measuring uncertainty [12], and could use self-consistency strategy to improve reasoning [13]. Magically, larger models can even know what it knows before generating actual responses [12]. Using properties above could potentially reduce hallucination and improve honesty, by producing more self-consistent responses, conducting self-calibration, or refusing to answer unknown questions. Besides, for larger base models, interpretability through natural language is more reliable and scalable to interpret the model itself [14].
With larger base models, it could develop many emergent capabilities, even many human-like behaviors, like power-seeking, sandbagging, deception, sycophancy, theory of mind etc. We could leverage some behaviors (e.g. debating [15], self-correction) during the alignment process, while we also need to mitigate other behaviors (e.g. deception, sycophancy etc.) for better alignment purposes.
Larger base models may even get some superhuman capabilities in some tasks. That motivates many people to elicit such superhuman capabilities by studying weak-to-strong generalization [16] or using the model itself directly for RLAIF [17], for the sake of scalable oversight [18, 19] to align a superhuman base model through weak human supervision or self evolution.
Many current autonomous agents leverage LLMs as a controller, requiring fundamental “autonomous” capabilities from the base model. The whole framework, including memory, tool using, reasoning and planning [20], are based on LLM’s “autonomous” capabilities, which means it needs to take self-reflection [21] and react to its previous actions, thoughts, plans, memories, interactions from other agents and humans, or world environment changes by itself. It is a generalist, and should not be heavily finetuned to any specific domains through our humans’ control or by the way we humans think it should act. This is not only because heavy finetuning could compromise its general and essential capabilities, but also because those injected behaviors might be inconsistent with the base model’s original capability. However, diverse data alignment tuning might still be required to elicit base models corresponding capabilities in practice. Nevertheless, LLM-driven agents should be mainly developed through prompting in the base model’s natural language space (e.g. transforming tools as natural language descriptions in the prompt [22], or as code blocks [23]). If we regard a retriever as a special tool, Self-RAG [24] as a retrieval paradigm also partially adapted self-reflection.

In terms of how to pretrain a base model with such strong capabilities, it’s out of the scope of this post. But briefly speaking, what we need to do is to collect a large amount of diverse, deduplicated and high-quality data to pretrain a language model. The main purpose is to ensure the best generalizability of LLMs and avoid collapse. Despite the debate over memorization or generalization, as long as we cover diverse enough domains, and cover enough data, it might not matter whether or not the LLMs are just doing memorization and paraphrasing.

Definitely, improving specific data ratio of pretraining data could improve some specific capability, but it’s still not very clear where the capabilities exactly come from, because pretraining data is too large to do clear control experiments. We should admit that it seems that some capabilities above are improved with post-training, although many of them just improve marginally. So it’s likely that post-training alignment is just to elicit them from the base model.

What’s the boundary between Capability and Alignment?

Then what’s the difference between alignment and capability? We should acknowledge that there are still no clear and well-recognized boundaries between capabilities and alignment. But I will try my best to explain it. Intuitively, alignment is to solve the problem when model has some capability but does not express such capability explicitly (e.g. “the model knows the answer to a question but it does not give out the answer” [26]), because of insufficiency of eliciting model’s capability (e.g. insufficiency of alignment finetuning or insufficiency of clear intents and requirements expressed in the instructions). In practice, it is often hard to test whether the base model has such capability, if the model declines to show such capability. It’s even harder to test and know the limits of a model with superhuman capabilities in some tasks, because human evaluations in those cases are not that reliable and scalable. Also, if we define instruction following capability as explicit intention alignment, probably such capability is mainly developed through finetuning. But if we are talking about fundamental capabilities (e.g. world knowledge, reasoning, coding, multilinguality etc.) required during instruction following, it still comes from the base model. Thus, in the case of instruction following finetuning (SFT or RLHF), finetuning is somehow like mapping explicit human instructions into the base model’s corresponding capability. Without such finetuning, we probably need in-context learning to elicit the corresponding capability from the base model [7].

Evidence of LLM capabilities coming from the base model

There has already been much evidence that various capabilities of LLMs mainly come from base models:

Only a few diverse instruction tuning data could align a relatively good chatbot model. This has been well recognized since Alpaca’s success with just 52k instruction data, which is much less than pretraining stage. This is a strong indication that major capabilities come from the base model. Early in 2021, Anthropic had shown that a very strong base model plus prompting could achieve reasonable alignment [1]. In early 2022, Anthropic conjectured the diversity of human preference data is important in their success of RLHF to do alignment (“Our hope was that data diversity (which we expect is very valuable) and the ‘wisdom of the crowd’ would provide comparable RoI to a smaller dataset that was more intensively validated and filtered”) [2]. In the same year, InstructGPT paper [3] observed that limited SFT data could lead to surprisingly good alignment, then they also had the conjecture that most capabilities come from the base model. In the Self-Instruct [4] paper at the end of 2022, it demonstrated how important a diverse instruction set is to avoid collapse of the base model. An even more important scientific insight in the Self-Instruct paper is that all the instructions and responses are generated from the base model unlike the aligned model used in later works including Alpaca and Vicuna, thus demonstrating the capability of the base model itself. In 2023, LIMA [5] and “False Promising” [6] papers did more extensive experiments to show the point that main capabilities come from the base model. And some work has also been shown alignment fine-tuning just adjusts probability distributions of the base model on some stylistic words [7]. Note that we do not argue that 1000 instruction tuning data, as stated in the LIMA paper, could actually align a base model to the ChatGPT level, but who knows if we get larger and stronger base models.
When it comes to RLHF, why was RL not working very well in text generation tasks, but works in RLHF, at least at OpenAI, Anthropic etc.? That’s because a strong base language model provides a good starting point of both the policy model and reward model.
1. RLHF works only if the policy is initialized with a strong base model. One previous belief of RL people is that a strong reward model could boost the capability of the policy, leading to a successful RLHF-ed policy model. But if that is true, SFT only without reward modeling would not have been relatively successful. So it should be the case that chatbot capabilities essentially come from the base model used to initialize the policy instead of the reward model.
2. Reward modeling (and critic model) is to capture the alignment goal but it could only work with a strong base model as initialization. To make a reward model (RM) learn real-world human intents, and alleviate rewarding hacking, one should rely on a very strong base model’s capability and generalizability, so that it can provide reliable reward scores over a wider distribution of input instructions and generated responses. If RM is not good, we have the wrong objective to optimize. It’s like using RL to optimize ROUGE for summarization in some earlier works in NLP.
3. During RLHF policy training, the reward gap between policies with different base model sizes will maintain near constantly [2], when training against the same reward model, which means that stronger and weaker base models’ capability gap are determined during pretraining, and will not shrink during eliciting process of RLHF alignment.
It’s hard to inject more domain knowledge or capability through finetuning than those exist during pretraining. This is typically demonstrated by the fact that carefully prompting a strong general model could match or even surpass a finetuned domain model, given proper evaluation [8]. The proper evaluation should reflect the real-world distribution of this domain or capability, instead of a narrow distribution.

Why do capabilities of LLMs mainly come from Base Models?

Then we might ask the essential reason why capabilities of LLMs majorly come from base models. Capabilities should essentially come from a large amount of data representing that capability, and there should be naturally no bound between pretraining and post-training. But data is scalable during pretraining, while, practically, it’s hard to scale it to the size of pretraining data during post-training and maintaining previous learned capability. We would explain why large amount of data during pretraining might enable various capabilities from three perspectives as follows:

From the AI evolution perspective, LLM might learn such capabilities through language acquisition similarly to humans. Language as the representation is currently the most effective way to learn a generalist AI, because there naturally exists a huge amount of text data on the web. Also, using natural language as representation, LLMs could even learn world models, and learn advanced (super-)human capabilities, assuming those have already been expressed via natural language on the web. Through the language modeling pretraining objective, any capabilities that can fundamentally reduce language modeling loss can potentially emerge [27]. This is also similar to how we humans learn about the world, boost knowledge and other capabilities (e.g. reasoning) through language acquisition. This also sheds light on why one should interact with and leverage the base language model as if we are communicating with and reshaping a human. That’s said, after pretraining, we’d better speak to and align the base model in the natural language space as much as possible so that the base model could understand and digest it better.

From the ML perspective, base model pretraining brings better generalizability. One foundational goal of machine learning is to generalize to unseen distributions, but neural networks are probably not that good at generalization. So the most effective way is to make neural models see nearly the whole distribution during pretraining, so that there is nearly no unseen distributions during usage. The essential reason why LLMs work so well as a generalist is probably that they have seen nearly all the real-world texts during pretraining. Thus the strongest base model could probably already have had the capability to handle all the text related tasks. Therefore, finetuning, as the current most effective way of alignment, is just to elicit base models’ capability to behave in accordance with the user’s intent and complete the user’s task/request, instead of teaching models much new capabilities. Note that it’s still hard to say whether LLM generalizability comes from memorization or generalization. For example, there was one recent debate between approximate retrieval v.s. understanding+reasoning.

From the information theory perspective, base model pretraining might develop intelligence capabilities through lossless compression. During pretraining, when we scale base models, LLMs could have a better language modeling loss (next-token probability) curve which generalizes to the whole texts including those unseen. The lower loss curve we have, the better compression ratio we have, meaning that we have better intelligence. If we truly believe that intelligence comes from such lossless compression, the core capabilities should also be developed during this pretraining process. It’s not possible for finetuning or other similar heavy alignment methods to see such scale of text data and compress them by obtaining a better language modeling probability trajectory over all the texts. Instead, the language modeling loss over pretraining corpus would even slightly increase after alignment finetuning, which might increase the compression ratio and make it a worse compressor, losing some capabilities as general intelligence.

Figure comes from https://www.youtube.com/watch?v=dO4TPJkeaaU

How to do Alignment?

To achieve the alignment goal, our principle is to respect the base model’s capability during alignment. This can be decomposed to three fine-grained principles:

Principle 1: Success of alignment methods highly depends on building the capability of a strong base model, and alignment is just to elicit them in the right direction.

Principle 2: We should maintain the general capability of the base model as much as possible during alignment.

Principle 3: If doing post training to achieve alignment, we should make sure diverse training inputs to maintain the strong general capability of base models.

Principle 1 lays the foundations for why various alignment methods could work and we choose them.

Principle 2 provides one important direction for alignment that is sometimes neglected, which is important for aligning a generalist LLM or a useful AI specialist in the real world. To achieve this, we need to avoid the base model’s collapse, align and then use it in a way that is consistent with the base model. In this sense, the trend from full finetuning to parameter efficient fine-tuning (PEFT, e.g. LoRA, prompt tuning adapters etc.), from discrete prompt optimization to prompt engineering is to gradually change the mindset of adapting LLMs into specific tasks/domains, to fitting any tasks/domains into pretrained language models without overfitting and collapse. Intuitively, changing sparse and less parameters could make base models forget and overfit less, and generalize better, especially for out-of-distribution (OOD) generalization, because those test-time OOD examples are more like in-distribution examples during the pretraining stage. Eventually, prompt engineering is to transform the target task to a natural language prompt, which is in the base model’s input language space and makes the model aware of the task without distorting its parameters or probability space, thus ideally maintaining all the capabilities of base models. Based on this principle, with larger and stronger models, one might expect that alignment procedures after pretraining should be more and more lightweight.

Principle 3 provides high-level guidelines of technical details to maintain the base model’s capability in post-training alignment methods. The “diversity” here encompasses both high coverage and balance. Notice that when the model is getting larger, during finetuning, it’s a common observation that memorization and generalization abilities both get stronger. That’s said, when we work with LLMs, it is easier to memorize and overfit specific patterns and degrade it to a trivial model, given too many similar examples without covering various distributions. Thus, high coverage of difference distribution is important. This is for the purpose that finetuning could support base model capability space without collapse, instead of covering most real-world use cases (e.g. input typos). Because covering most real-world use cases is infeasible when collecting finetuning data, which is why we rely on the base model’s capability to handle most cases. Besides better memorization, a larger model is also easier to handle similar or more generalized samples given just a few examples (more sample-efficient), because it does not need to learn many new things but just need to elicit similar distribution during pretraining. That explains why some limited instances in a narrow distribution could already elicit the base model’s corresponding capability. So we could avoid too many examples in specific distributions and maintain a balance across distributions. In a word, diverse examples during post-training is very important to maintain a base model’s capability in each method.

We use these principles to explain why we choose various alignment methods, and how to avoid pitfalls with each method.

Alignment During Pretraining

To build a base model’s capability that is more consistent with our alignment goal (Principle 1), one could make the pretraining data more aligned to the target values during base model pretraining, (e.g. filter out harmful or biased contents for the purpose of safety alignment). Besides, a larger and stronger model could potentially have the capability to automatically resolve the conflicted intentions in the pretraining data and align to the spirit of majority data, which represents common human values and alignment goals. One could also expect that the essential instruction following capability increases when the base model is larger and stronger. This makes it possible to use native prompting to lead to a more helpful and harmless model. One could find details in the next section. Meanwhile, the larger model’s self-consciousness, self-consistency, and calibration capability also get stronger, making it more truthful and honest, by outputting meta-knowledge [12], self-consistency checking [13] and calibration [12].

Alignment With Prompting

By prompting instead of finetuning, we can maintain the capability of the base model as much as possible (Principle 2). Actually, prompting itself is already a relatively strong alignment method, which has minimal alignment tax compared with finetuning [1]. Alignment with prompting alone could work depending on that we have built a base model with strong capabilities (Principle 1), especially the instruction following and in-context learning capability. In this sense, prompting is just eliciting them. Specifically, proper instruction prompting [1], combined with in-context learning [7], will make the models more helpful, given a large enough model. The larger model would also have the capacity of moral self-correction [11] (i.e. targeted prompts and instructions alone could make the larger model harmless).

Alignment by Fine Tuning

When it comes to alignment through finetuning, it is probably the most effective method for alignment till now. Again, maintaining the base model’s capability without degradation of generalizability (Principle 2) during eliciting is crucial to avoid caveats during this process. People could rely on supervised finetuning, best-of-N rejection sampling method, DPO or PPO RLHF to align the model to human values. Sometimes, using strong base models’ capabilities as the successful sauce (Principle 1), Self-rewarding LLMs [28], Constitutional AI [17, 34, 35], scalable oversight [18, 19], weak-to-strong generalization [16], alignment via debate [15] are claimed to be potentially able to align a superhuman base model.

SFT Alignment

For supervised finetuning, the diversity of instructions and quality of annotated responses are the most important things (Principle 3). This has been mentioned in Self-instruct, LIMA and many other papers [29]. If we imagine that each instruction could elicit one corresponding ability of the base model, and enable the model to follow various instructions of all similar tasks, a set of very diverse instructions could cover most instruction following cases. Note that it is still questionable whether 1000 examples (e.g. in LIMA paper) is enough to lead to a good instruction following model, because more instructions are typically required to cover more rough types of fine-grained capabilities. But if there is too much instruction tuning data (e.g. much more than 52k data used in Alpaca), it’s hard to maintain its balance (one dimension of diversity) and could lead to loss of base model’s general capability and overfitting (againt our Principle 2). Essentially, the reason why Vicuna using ShareGPT data has always been very strong in the lmsys chat arena leaderboard till now, is that the most high-coverage instruction data are those provided by a large group of real-world users (ShareGPT), and it also mostly aligns with human intents. This is consistent with our goal of having a general instruction-following model, instead of optimizing performance on specific tasks. But even for a specific task, leveraging too much task-specific data instead of relying on the base model’s capability is still dangerous. Because there could always be out-of-distribution data for this task, when facing real-world usage. Thus overfitting to limited distribution of this task during finetuning could have a negative impact.

RLHF Alignment

Why could RLHF work and be superior to SFT?

RLHF could work because of using a base model with strong capabilities to initialize both reward model and policy (as well as critic and reference models), which is essentially according to our Principle 1. Let’s first see why RLHF could theoretically work better than SFT imitation learning under the reward modeling framework.

From the perspective of the reward model, RL people consistently had the sense that a well-generalized reward model is one of the most important things for RLHF to work [25]. First, “ RLHF could generalize better since evaluation is usually ‘easier’ than generation for most tasks we care about” [25], which is one fundamental assumption that Reward Modeling and RLHF could work [25]. That is especially true when the reward base model has strong and deep understanding capability (Principle 1). Second, we have the hope that “the model can learn user intentions to a sufficiently high accuracy”, which can generalize to wide distributions. That is another fundamental assumption for Reward Modeling [25]. Similarly, this can only come true with a very strong reward base model (Principle 1). Finally, “the ambition of reward modeling is to solve all design specification problems, thus all we need to do is equip the LLM agent with the ‘correct’ reward function and reward modeling is meant to be a one-stop solution for this entire alignment problem” [25]. Again, a strong reward base model gives us such hope. In summary, all of the above requires a strong base model initializing the RM and eliciting strong capabilities from the base model to complete preference/reward labeling task (Principle 1).

From the perspective of policy updating towards higher reward, first, “reward modeling allows us to shift the burden of solving the credit assignment problem from the user to the policy of RL”[25]. Using a strong LLM base model as the starting point, policy should have the capability to automatically assign credit on the whole response to each token (Principle 1). Second, “it is generally difficult to outperform humans using imitation learning alone: even a perfect imitator can only perform as well as the source it is imitating; superhuman performance typically comes from executing human action sequences faster and more reliably by smoothing out inconsistencies in human behavior” [25]. Using LLMs with strong capabilities to initialize the policy also gives us the hope of producing such action sequences (Principle 1). Finally, because the responses are generated from the policy model itself and more consistent with the model knowledge, they can reduce hallucinations from the base model’s perspective by maintaining its integrity (Principle 2). Such a way of maintaining the base model’s capability and generalizability (Principle 2) could also probably contribute to better OOD generalizability on real-world user inputs of the final policy model [37].

How to construct a reward model?

Our purpose is to make the reward base model able to do preference labeling tasks in an aligned way, instead of altering the reward base model’s capability significantly.

By using strong base models as initialization (Principle 1) and higher volume of real-world preference labeled data to cover various intents over a wider range of inputs, we have the hope that the model could learn user intentions to a sufficiently high accuracy.

Besides, we also need to maintain generalization capability of the base model (Principle 2), so that we can mostly eliminate cases of reward hacking from the RM side. Only in this way, it can provide reliable reward scores on out-of-distribution input (instruction and response pairs). Then the reward given by the RM on a wider range of examples would not mislead policy update to an incorrect direction. Otherwise, as reward modeling scaling law [30] demonstrated, scores of gold human preference (gold reward model) will reach a plateau or even decrease, although the reward model’s rewards increase during policy updating in RL training (accompanied by increase of KL divergence between updated policy and initial policy). To overcome this, more real-world preference data also has its values, which ensures the input diversity and more broad coverage of human intents (Principle 3). That could alleviate the ambiguity and inconsistency of preferences, and avoid overfitting to any specific area of intents, thus alleviating RM overfitting and reward hacking by reflecting more reliable true general human intents insteads. Meanwhile, if we believe the reward model capability is still mainly from the base model, which is strong enough to generalize to most input and response distributions. One principle should be that no matter whether we construct instructions from models or collect instructions from real-world users, we need to make sure that the instructions should have high diversity [2, 23] (Principle 3). Also, the response pairs should have high diversity [23] (Principle 3), which could be produced by proper sampling strategy and various model families and checkpoints [31].

Even if the reward model is correct in all sense, "how we can train the agent policy to robustly produce behavior incentivized by the reward model" [25] is still an open question. That means, from the RM perspective, the reward model should be robust enough under optimization pressure.

How to update the policy?

From the perspective of policy, the reward model should accurately encourage or discourage policy base model’s behavior during self-exploration, although it does not necessarily teach the policy models new capabilities. That said, we should respect the base model’s knowledge (Principle 2) by avoiding data points that conflict with the model's knowledge, even if they are true. In other words, “a model should be honest about its own capabilities and levels of knowledge, not sufficient for it to simply imitate” [12]. RLHF can automatically maintain this because all the responses are generated by the base model itself according to its own knowledge. In the same spirit, during human labeled data collection of SFT, we should collect those data points that are not only true but also consistent with the model’s knowledge. That requires more effort on data annotation and collection, but alternatively can be partially solved by best-of-n sampling combined with SFT.

Note that there are some other regularization tricks to better maintain the policy base model’s general capability (Principle 2). For arithmetic regularization, PPO RLHF added an KL divergence or equivalent clipped gradient update, and added an additional reward that penalizes the KL divergence between the learned RL policy and the initial policy. For data regularization, InstructGPT also added a pretraining next-token prediction loss term over the training objective during alignment.

RLAIF Alignment

The reason why RLAIF could work is essentially also to leverage strong capabilities of base models to provide preferences (Principle 1), which enable reliable (strong capability) and scalable (automatic model annotation) oversight. Through this, one could still provide human values to be aligned through principles with minimal intervention to the model (e.g. constitutional AI [17]), while avoiding degradation of the base model’s general capability, because we can avoid tuning the reward model with limited and biased human preferences and maintain base model’s capability (Principle 2). Most importantly, we can potentially leverage many superhuman capabilities of the base model (Principle 1) to get a superhuman AGI, especially if we believe evaluation is easier than generation in many tasks [25].

How to align a superhuman AGI through reward modeling? One natural and unified way is to use weak-to-strong generalization to elicit a base model’s capability (Principle 1), although this still poses challenges towards scalable reward modeling [16]. Another is to use scalable oversight methods like recursive reward models [25], which, however, involves more steps and human interventions, potentially hindering base model’s generalizability and scalability (which is slightly against our Principle 2).

DPO Alignment

DPO seems to have all the advantages of PPO and is more convenient. But we would try to use our principle to compare DPO and PPO, potentially predicting whether DPO will actually be better in the long run.

For DPO, the reward model is implicit, and the generation and reward models are unified, which seems to be more natural and avoid loss of generality, thus more aligned with our Principle 2 to maintain the general base model’s capability. This could be one of the reasons why it could be potentially even better than PPO. But there are also some potential risks compared with PPO. Mixing an RM with the generation policy model could potentially compromise eliciting the evaluator’s better judgment capability (slightly against Principle 1), if it is true that evaluation is easier than generation over some capabilities. There is no explicit reward model so that we can observe, evaluate and control, although we can probably recover the implicit reward model similarly to traditional inverse reinforcement learning (IRL). From the perspective of human behavior, Using DPO (IRL) is similar to “assuming the human is acting to optimize their reward directly, even when this is an inefficient way of communicating their preferences” [18].

Self-rewarding DPO [28] might somehow alleviate such issue by using the generation model as an explicit reward model in each iteration to achieve AI feedback, which not only elicits base model’s judgment capability explicitly (Principle 1) but also leverages the transferability between RM and the generation model (Principle 2). Such transferability is achieved by “translating natural language utterances into the rigid format required for the data set the reward model is trained on, making collecting feedback much more efficient.” [25] That’s said, the model could get more feedback in natural language format, as “natural language is a natural form of feedback for humans” [25]. But there is still another risk no matter whether or not using self-rewarding for DPO: the generation model may update the reward model to make it more easier to hack, making reward hacking more severe [18].

Why do many evaluation results still seem good without following our principles? – The importance of proper evaluation

Why do many improper alignment methods that are not following our principle still show good results upon evaluation? That’s because of the imperfection of evaluation data and metric, whose design essentially encompasses the alignment goal.

The obvious common sense is that no single benchmarks are reliable to evaluate a model, especially when we evaluate a model’s general capabilities without overfitting to any tasks or distributions. So if anyone claims finetuning could teach the model some capabilities and significantly increase benchmark scores, that’s obvious but meaningless. First, any benchmarks could just measure performance on a subset in one dimension, which does not reflect general human intents. Second, such finetuning would always sacrifice some general capabilities of models. Importantly, evaluation of alignment definitely requires real-world human feedback (e.g. chatbot arena which has dynamic user inputs and large human feedbacks), because only a wide range of diverse human feedback could reflect diverse human intents in diverse scenarios, which is consistent with our specification of alignment goal. But AlpacaFarm and MT bench, as fixed set evaluation of chatbot instruction following capabilities and multi-turn capabilities, face the same issue as other benchmarks. So do not overrate the performance there. Literally, as Goodhart’s law says, “When a measure becomes a target, it ceases to be a good measure”.

Also, proper evaluation is important to distinguish between capability and alignment by identifying the boundaries of base model capability. Essentially, we need reliable and scalable evaluation methods to explore the limits of model’s capability. When the base model becomes stronger and stronger, it could have superhuman capabilities in some tasks, which means, if we fully elicit the base model’s capability, the tasks it performs will be hard to be evaluated by humans. Obviously it is even harder for people to evaluate where the base model’s capability boundary is before fully eliciting it. In that case, because there is no ground truth, we probably have to rely on a superhuman model itself to do evaluation, optionally equipped with human-AI collaborated evaluation (as demonstrated in sandwiching experimental paradigm) [19]. That’s also the case where we need to build trust with the model, so that we can believe it when using it as an evaluator or supervisor. Interpretability could play a significant role to build trust in such a scenario. As stated in section two, making scalable explanations should probably also rely on a strong and consistent base model itself as a basis, through the model’s natural language expression [14].

Note that any progress in evaluation discussed above could also contribute to model improvement, because “if we had some procedure to evaluate, we could also leverage this evaluation procedure to train the system to be more aligned” [25].

What’s the implications and future research questions considering major capabilities come from base models?

How to elicit more LLM’s capability if we believe LLM has a certain capability? This is practically important. Most of the time we thought LLMs could not do something, it’s just because we did not find a way to elicit such capability (e.g. CoT with respect to reasoning capability, Self-extend with respect to long-context modeling capability etc.). And what makes an ability harder or easier to be elicited? [16]
How to reliably evaluate LLM’s capability boundary, especially before explicitly eliciting them? Where is the boundary of average and expert human capability? How do human and LLM capability compare? Those are especially important for leveraging LLM’s superhuman capability. Without gold human labels as evaluations on some tasks, how to evaluate a model’s superhuman capability? Or can we build trust with the strong model itself, or build trust with the strong model evaluation, probably through better interpretability?
How to improve language agents’ capabilities as a whole, considering there is no moat for current LLM-driven agent frameworks? The moat is still the fundamental LLM capability, including long-context understanding, reasoning, planning and coding capabilities. There have been multiple agent frameworks (e.g. AutoGPT as proof-of concepts where there is an explicit planning stage; langchain implementation of ReAct, self-asking and so on which easily integrates various tools and memories; AutoGen for convenient implementation of multi-agent and human-agent communication and code execution), but all of them could be easily replaced by OpenAI AssistantsAPI, because OpenAI has the most powerful GPT4 base model, and people who want to build usable agents will rely on it. Also AssistantsAPI framework are easily reproducible by the open source community (e.g. Huggingface). So how to improve open-source base models’ agent-related capabilities?
During Alignment, how to learn from human preferences better? DPO or PPO, which is better, considering both of them have advantages and disadvantages (refer to DPO section)? If still using Reward Modeling framework, how to effectively avoid reward hacking, and increase robustness of RM under distributional shift? Is better leveraging the base model’s capability and generalizability a secret to a successful reward model? How to collect diverse and high-quality preference data that can represent real-world human values and specific human values?
How to achieve scalable oversight during alignment? Is leveraging the base model’s capability in a unified way (e.g. weak-to-strong supervision), or more fine-grained and decomposed supervision (e.g. recursive reward modeling) a better way? How to leverage a model’s advanced capabilities and behaviors (e.g. self-debate, self-correction), to help with this?
How to improve salience [16] to large language models on the alignment direction? Adapting the task to LLMs through natural language prompting or by changing its format to generative format makes the pretrained model more suitable for the task, which essentially increases the task’s salience to LLMs. Continual generative training on the target domain or tasks could also increase task’s salience to LLMs [16]. Improving saliency could be a general principle, but what is the exact definition of “salience” and how to improve salience are still open questions.
How to do continual training or post training to inject knowledge while preserving base LLM’s capability? Is that actually infeasible? Probably there could even be some way to reactivate the base model’s capability after seemingly collapse or overfitting. Some evidence might show that some capability of base models after certain overfitting is just hidden instead of disappeared, considering base models memorization capability is much stronger after scaling [32].

The longstanding bitter lesson [33] says that both scaling and search are important to achieve essential progress of A(G)I. Scaling is definitely the key to base model capability (while data quality and diversity might somehow mitigate the scaling requirements). However, nowadays, search is better to be defined as leveraging the base model’s capability for self-exploration. RL in this case is not to teach, but to elicit and encourage/discourage the base model’s capability. As John Schulman explains “Trust Region Utilitarianism”: “there is a sensible utility function to maximize, but it’s only valid locally around the current state of the world, where the intuitions that produced it are grounded.” A base model with strong capability provides us with such a trust region. And this could probably be a path towards AGI, combining foundations models and self-exploration search. Also the underlying lesson still holds: No matter during pretraining or alignment, let the model itself learn and adapt, and do not force the model to learn in the way how we humans think it should be (e.g. by using too much architectural inductive biases).

References:

A General Language Assistant as a Laboratory for Alignment
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training language models to follow instructions with human feedback
Self-Instruct: Aligning Language Models with Self-Generated Instructions
LIMA: Less Is More for Alignment
The False Promise of Imitating Proprietary LLMs
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
https://www.microsoft.com/en-us/research/blog/the-power-of-prompting/
Effective long-context scaling of foundation models
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
The Capacity for Moral Self-Correction in Large Language Models
Language Models (Mostly) Know What They Know
Self-Consistency Improves Chain of Thought Reasoning in Language Models
https://openai.com/research/language-models-can-explain-neurons-in-language-models
https://openai.com/research/debate
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Constitutional AI: Harmlessness from AI Feedback
Concrete Problems in AI Safety
Measuring Progress on Scalable Oversight for Large Language Models
https://lilianweng.github.io/posts/2023-06-23-agent/
Generative Agents: Interactive Simulacra of Human Behavior
Langchain: https://python.langchain.com/docs/modules/agents/tools/
Gemini-Ultra Report: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Scalable agent alignment via reward modeling: a research direction
https://aligned.substack.com/p/what-is-alignment
https://bounded-regret.ghost.io/emergent-deception-optimization/
Self-Rewarding Language Models
Data Diversity Matters for Robust Instruction Tuning
Scaling Laws for Reward Model Over-optimization
UltraFeedback: Boosting Language Models with High-quality Feedback
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
SALMON: Self-Alignment with Principle-Following Reward Models
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Data Engineering for Scaling Language Models to 128K Context
Understanding the Effects of RLHF on LLM Generalisation and Diversity

Appendix:

Scalable oversight is to solve the problem where human oversight (supervision) is not scalable, either because it is too time and effort consuming to collect supervision, or because the task is too difficult for humans to reliably supervise.
Recursive Reward Modeling: decompose a difficult task to create new sub-tasks whose goal is to help humans evaluate responses on the original task, and human/AI can provide reliable supervision to train Recursive Reward Models.

Jingfeng Yang