Now we will have some grounding for when weird ChatGPT behaviors are intended or side-effects — shrinking the Overton window of RLHF bugs.

OpenAI recently shared what they call their “Model Spec,” a document that details their goal model behaviors prior to clicking go on a fine-tuning run. The name is a little vague, it’s about the model behavior and how OpenAI steers their models from behind the API. I doubt the actual model capabilities change much in this process, which is what the specification could also cover.

In the context of other great talks from post-training lead John Schulman on the challenges of RLHF with ChatGPT in the last year, it’s a funny sequence of events to get this document afterward. The Model Spec seems like something that should be set out for a model before the model is productized. It’s one of those many reflections into the way we are building AI that feels out of order, rushed, confused, and always so exciting. What really happened is that the training team had a document like this, likely used in collecting data used in RLHF, and the goals were adjusted when the product team asked for a modern one.

Regardless, Sam Altman shared a brief explanation of what this means on Twitter.

we will listen, debate, and adapt this over time, but i think it will be very useful to be clear when something is a bug vs. a decision.

To cement this, I particularly resonate with the point made by Joanna Jang, who is on the product team at OpenAI:

principles are easier to debate and get feedback on, vs. hyper-specific screenshots or abstract feel-good statements.

When taking a long-term view on AI, wanting to make sure its development aligns with our stated (and maybe subconscious) goals, having all this information on the table is a big step forward. It’s no longer just in private conversations with congressional staffers and off-the-record briefings.

In the spectrum of realistic desires for leading model providers to disclose about the RLHF process, this was very high up the list. The core building blocks such as preference data are often tied up with complex enterprise contracts from data providers. The intermediate building blocks like reward models, which I’ve wanted to see for quite some time, come with more exposure risk and complexity than the upside to the few researchers who understand how to study them.

The model spec is a transparency tool that reduces liability from users, regulatory oversight, and legal threats because all of these groups care about what OpenAI wants to achieve, which sometimes can seem nebulous and a little crazy, rather than what actually showed up in the user's chat box. Many legal landscapes actually depend on the intent of the actions more than the actions themselves. The model spec, which I’m happy to take OpenAI at face value and believe is accurate, is a gold mine for people like me. When you zoom into the details and track the breadcrumbs from InstructGPT’s preference data instructions document through John Schulman’s multiple RLHF talks | last year and to this model spec, we have a lot to learn (or at least confirm).

Transparency is often a prerequisite for accountability. If there is something wrong with how these models are approached, we then know what to attempt to change via public channels.

Reviewing the Model Spec

Overall the examples and principles detailed in this document are extremely reasonable and point to powerful underlying AI development and astute business sense. Many of the principles make much more sense in an applied setting rather than a research setting, where I operate. The principles act most clearly in an AI system and creating a document like this for something like Llama 3 wouldn’t make much sense — it would need to be about as a whole.

OpenAI’s stated objectives in the Model Spec are quite simple (emphasis OpenAI’s):

They have a set of default behaviors that seem like model training properties: