Google ships it: Gemma open LLMs and Gemini backlash

Google rejoins the open model party and gets some backlash for a frequent problem for generative AI.

On Wednesday, February 21st, Google announced their open-weight models from the Gemini line of work:

Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Developed by Google DeepMind and other teams across Google, Gemma is inspired by Gemini, and the name reflects the Latin gemma, meaning “precious stone.” Accompanying our model weights, we’re also releasing tools to support developer innovation, foster collaboration, and guide responsible use of Gemma models.

Gemma is pretty widely available for an initial launch. The weights of pretrained and fine-tuned models are on HuggingFace, it works in JAX, PyTorch, and TensorFlow, popular model training repositories, they gave us a “ready-to-use” Colab, and it comes with commercial-friendly terms of use (that has another issue we’ll get to).

Regardless, this model is great! It sets a new standard at 7 billion parameters, pretty steadily surpassing Mistral 7B. In summary:

4 model versions: 2 and 7B* variants of the base model and RLHF model (denoted -it). It’s really an 8B model.
Open commercial weights: The weights are available commercially, but come with two terms that restrict model use — 1) no waifus and 2) requirement to update the model similar to the original RAIL license.
Weird architecture: many architecture choices are not standard and could make it harder for people to fine-tune the model (e.g. compatibility issues with FlashAttention 2).
Very likely the Gemini tokenizer: multi-lingual tokens and image tokens indicate this is very likely the Gemini tokenizer (it’s Llama-like). Releasing it isn’t a huge deal, tokenizers don’t offer a lot of gain but offer a lot of downside if you get them wrong.
Pretraining annealing: they end pretraining with “relevant, high-quality data” to improve performance. This largely boosts the base model evals but hasn’t been documented to help downstream alignment.
Alignment (partial) details: Google uses REINFORCE, a larger reward model, and the InstructGPT SFT KL penalty.
Extra resources: Collection on HF with all weights, blog post, and a technical report.

At the same time, Google has been dealing with backlash to the image generations from its flagship model series, Gemini. The title from The Verge tells most of the story: *Google apologizes for ‘missing the mark’ after Gemini generated racially diverse Nazis.* In short, much like DALLE 2 in recent years, Google got backlash for too strongly adding bias corrections to their model. In this case, talking with many researchers on the ground, it is that multimodal RLHF and safety fine-tuning is much harder than a single modality, so I expect this to be more of a blip than a major story. I’ll still cover what it means later in the post, but for now, we start with Gemma.

DALL·E 2024-02-21 19.36.21 - A wide image of a stunning, highly detailed precious stone resting on a pristine, clean background. The stone, radiating with an array of mesmerizing .webp

Getting to know Gemma

Gemma takes the next step on the LLaMA 7B to Llama 2 to Mistral progression of 7 billion-ish parameter models. The core model evaluations are really strong across a wide range of tasks (including coding) and I suspect a ton of fine-tuned models in the coming months are based on it.

Most of this post will focus on the 7B model (and its aligned variant), which is also clearly more of an investment from them based on compute spend. The 7B model is trained for 6 trillion tokens, getting close to the rumored 8 trillion of Mistral 7B, while the smaller 2B parameter base model is only trained for 2 trillion tokens. The 2T model also has more of a mixed bag of results when compared to the known-to-be evaluation fishing Phi 2 model. The core point of my most recent post on open base models, OLMos, is the need to compare models weighed by parameters and token count. Llama 2 is only trained on 2T tokens, OLMo 7B gets close to it by training on 2.5, and it makes sense that Mistral and Gemma are way better with 6-8T tokens.

While the pace in progress of base LLMs on the data and architecture side is high (architecture other than the benefits of mixture of experts is mostly about optimizing hardware and training stability), it’s not worth it to spend more time training a base model. In a few years, we’ll see models are every major size checkpoint easily trained on twice as many tokens as we’re seeing now. By the end of the year, 1B to 7B models on 10T tokens will likely happen, and 20 or 30T isn’t ridiculous in the long timeframes. It makes comparing models harder, but it’s the next evolution of scaling laws.

One technicality that I need to add before showing the evaluations is that it’s more of 8B language model with .7B embedding and 7.75B non-embedding (total well over 8B), but they wanted to ride the comparison wave where everyone trains on 7B parameters. 7B is the people’s model size. It is accessible to many to fine-tune and accessible to almost everyone for inference.