How instruction-tuning can encourage hallucinations
or How we may be instructing LLMs to hallucinate
Instruction-tuning (i-tuning) has been all the rage recently with the advent of InstructGPT and ChatGPT and the realization that this makes raw language models way more useful to humans (some have used the term alignment, but that's a much broader topic). With the open-source release of reasonably strong LLMs such as Meta's LLaMA (which is arguably close to GPT-3 quality), the race is on to instruction-tune them. In this note we point out some issues in the way they are commonly i-tuned.
How supervised fine-tuning might encourage hallucinations
Instruction-tuning is important because raw language models are only trained to predict the next token/word (usually). In i-tuning we’re training the language model to follow instructions provided by the user (i.e. prompt). The InstructGPT paper which serves as a recipe for the OpenAI i-tuned API models outlined two main training phases:
Supervised fine-tuning to demonstrations (SFT)
Learning from (human or AI) feedback (LF)
Open-source has primarily been doing SFT because it gets you pretty far, and it's tricky to do Reinforcement Learning from Human Feedback (RLHF), the original proposal to do LF in earlier i-tuning work (e.g. “Learning to summarize from human feedback”). In this note, we describe how a common way of using SFT for i-tuning is flawed.
SFT is simply taking a pre-trained language model as a starting point and running supervised-learning on input/output examples, in this case an i-tuning dataset of prompts and responses. Consider example 0 from 'databricks/databricks-dolly-15k':
i.e. the input is a prompt and the output is a response. This works reasonably well for for grounded generation such as closed-book question-answering/summarization where all the necessary information is provided in context, or classification-type tasks, like multiple-choice questions, where the model is only expected to output a class label. However, we run into trouble for more open-ended generation, especially those that are information-seeking without grounding context. Consider this prompt/response from the OpenAssistant Conversations paper:
The issue is that the model here is expected to retrieve the answer from its parameters, but it has a very low chance of being correct if it hasn't even seen the relevant knowledge in pre-training (in this case, studies about Novalgin). Different models have been pre-trained on different corpora and for a different number of tokens, and so different models have different internal knowledge. Depending on the model, the process of instruction-tuning could be encouraging one of two types of behavior:
retrieve the knowledge from parameters and answer the question correctly
guess or make-up the answer (because the knowledge is missing from parameters)
So in some cases we're encouraging hallucination! Ideally, a model says something to the effect of "I don't know" when it doesn't have the required knowledge to answer. However, the underlying problem of open-source i-tuning data is it's model-independent. It's impossible to construct a single instruction-tuning dataset that is appropriate for all models.
Learning from feedback to the rescue?
The second phase of learning from feedback (LF) is useful to further align with human preferences. Is this enough to solve our hallucination issues? Given 3 possible responses to an information-seeking query,
factually accurate answer
"I don't know" answer
factually inaccurate answer
most people would prefer (1) > (2) > (3), but whether the model truly doesn't know is model-specific. There is currently no public method to determine the internal knowledge of an LLM and to properly annotate i-tuning data to be model-specific.
If you look at OpenAI's Model index for various davinci models, text-davinci-002
— which works pretty well — is entirely supervised learning, although examples are filtered by humans, which they call FeedME (model-specific SFT). This is one way to incorporate model-specific feedback, although in this case the model only sees good examples.
The power of RLHF (often implemented with PPO by OpenAI) is giving the model feedback about human preferences, which involves signal from both good and bad examples.
Recently there has been a new class of techniques for LF that also incorporate such feedback, which could be categorized as contrastive fine-tuning: our SLiC-HF, RRHF, and DPO. These are notably easier because it makes LF about as easy as fine-tuning, while giving competitive results compared to RLHF-PPO.
While an in-depth comparison of these methods is outside the scope of this post (stay tuned), one strategy is to rely entirely on feedback collected for another model. While this avoids potentially expensive feedback collection and is likely useful to some degree, to truly get around the issue described above, one still needs to collect and incorporate feedback specific to the model. In particular, using the exact same i-tuning dataset for different-sized models is almost certainly going to produce instructed hallucinations. Don't blame the model, blame the teacher for hallucinations.
Some of the points here partly expand on some insightful points made by John Schulman (lead for ChatGPT) in his recent Berkeley lecture. I recommend watching that.