Google rejoins the open model party and gets some backlash for a frequent problem for generative AI.
On Wednesday, February 21st, Google announced their open-weight models from the Gemini line of work:
Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Developed by Google DeepMind and other teams across Google, Gemma is inspired by Gemini, and the name reflects the Latin gemma, meaning “precious stone.” Accompanying our model weights, we’re also releasing tools to support developer innovation, foster collaboration, and guide responsible use of Gemma models.
Gemma is pretty widely available for an initial launch. The weights of pretrained and fine-tuned models are on HuggingFace, it works in JAX, PyTorch, and TensorFlow, popular model training repositories, they gave us a “ready-to-use” Colab, and it comes with commercial-friendly terms of use (that has another issue we’ll get to).
Regardless, this model is great! It sets a new standard at 7 billion parameters, pretty steadily surpassing Mistral 7B. In summary:
At the same time, Google has been dealing with backlash to the image generations from its flagship model series, Gemini. The title from The Verge tells most of the story: *Google apologizes for ‘missing the mark’ after Gemini generated racially diverse Nazis.* In short, much like DALLE 2 in recent years, Google got backlash for too strongly adding bias corrections to their model. In this case, talking with many researchers on the ground, it is that multimodal RLHF and safety fine-tuning is much harder than a single modality, so I expect this to be more of a blip than a major story. I’ll still cover what it means later in the post, but for now, we start with Gemma.
Gemma takes the next step on the LLaMA 7B to Llama 2 to Mistral progression of 7 billion-ish parameter models. The core model evaluations are really strong across a wide range of tasks (including coding) and I suspect a ton of fine-tuned models in the coming months are based on it.
Most of this post will focus on the 7B model (and its aligned variant), which is also clearly more of an investment from them based on compute spend. The 7B model is trained for 6 trillion tokens, getting close to the rumored 8 trillion of Mistral 7B, while the smaller 2B parameter base model is only trained for 2 trillion tokens. The 2T model also has more of a mixed bag of results when compared to the known-to-be evaluation fishing Phi 2 model. The core point of my most recent post on open base models, OLMos, is the need to compare models weighed by parameters and token count. Llama 2 is only trained on 2T tokens, OLMo 7B gets close to it by training on 2.5, and it makes sense that Mistral and Gemma are way better with 6-8T tokens.
While the pace in progress of base LLMs on the data and architecture side is high (architecture other than the benefits of mixture of experts is mostly about optimizing hardware and training stability), it’s not worth it to spend more time training a base model. In a few years, we’ll see models are every major size checkpoint easily trained on twice as many tokens as we’re seeing now. By the end of the year, 1B to 7B models on 10T tokens will likely happen, and 20 or 30T isn’t ridiculous in the long timeframes. It makes comparing models harder, but it’s the next evolution of scaling laws.
One technicality that I need to add before showing the evaluations is that it’s more of 8B language model with .7B embedding and 7.75B non-embedding (total well over 8B), but they wanted to ride the comparison wave where everyone trains on 7B parameters. 7B is the people’s model size. It is accessible to many to fine-tune and accessible to almost everyone for inference.