Research shows: Google Gemini Pro falls short of OpenAI's old model

2023-12-20

Less than a month ago, Google showcased its long-rumored ChatGPT competitor Gemini in a stunning demo video. However, new research has found that the most powerful version of Gemini currently available to consumers, Gemini Pro, lags behind OpenAI's GPT-3.5 Turbo large language model (LLM) in most tasks.

This is according to a research team from Carnegie Mellon University and a company team called BerriAI. Their paper "A Deep Dive into Gemini's Language Abilities" was published on arXiv.org. As the paper points out, "We find that, as of the time of writing this paper (December 19, 2023), Gemini's Pro model performs comparably to OpenAI's current version of GPT 3.5 Turbo in terms of accuracy, with a slight disadvantage."

For Google researchers and their leadership who have spent a lot of time working on Gemini, this conclusion is undoubtedly a blow. In response to this report, a Google spokesperson insisted that Gemini Pro outperforms GPT-3.5 and the more powerful version Gemini Ultra, which is set to be released in early 2024, scores higher than GPT-4 in Google's internal research. Here is their full response:

"In our technical paper, we compared Gemini Pro and Ultra to a range of external LLMs and our previous best model, PaLM 2, covering text-based academic benchmarks such as reasoning, reading comprehension, STEM, and coding.

These results [Table 2 on page 7 of the report] show that Gemini Pro outperforms reasoning-optimized models like GPT-3.5 and is on par with several of the most capable models currently available, while Gemini Ultra outperforms all current models.

Specifically, on MMLU, it can outperform all existing models with an accuracy of 90.04%. It is also the first model to surpass this threshold, with the previous highest being 86.4%."

"Additionally, it is worth reading the Gemini authors' discussion of the subtleties of these evaluations in the paper (also on the same page), for which I have excerpted:

'Evaluations on these benchmarks are challenging and can be affected by data contamination. We conducted extensive leakage data analysis after training to ensure the results we report here are as scientifically reliable as possible, but still found some minor issues and decided not to report results such as LAMBADA (Paperno et al., 2016).

As part of the evaluation process, we found that an additional 100-step fine-tuning on a popular benchmark, HellaSwag (Zellers et al., 2019), improved performance on specific website excerpts not included in the Gemini pretraining set (these excerpts were not included in the Gemini pretraining set), with Gemini Pro's validation accuracy increasing to 89.6% and Gemini Ultra to 96.0%, measured using one-shot prompts (we measured GPT-4 achieving 92.3% through one-shot evaluation via the API)."

This indicates that benchmark results can be influenced by the synthesis of pretraining datasets. We chose to report only decontaminated HellaSwag results in 10 evaluation settings. We believe more powerful and nuanced standardized evaluation benchmarks are needed, without leaked data. Therefore, we evaluated the Gemini models on several newly released held-out external evaluation datasets, such as WMT23 and Math-AMC 2022-2023 questions, or generated internally from non-web sources, such as Natural2Code.

We recommend readers refer to the appendix for a comprehensive list of our evaluation benchmarks. Nevertheless, the model's performance on these benchmarks gives us an indication of its capabilities and may have implications for real-world tasks."

For example, Gemini Ultra's impressive reasoning and STEM capabilities pave the way for advancements in LLM in the field of education. Its ability to handle complex mathematical and scientific concepts brings exciting possibilities for personalized learning and intelligent tutoring systems."

Content Tested by Researchers

In their new paper, researchers from CMU and BerriAI tested four different LLMs: Google's Gemini Pro, OpenAI's GPT-3.5 Turbo, GPT-4 Turbo, and the new open-source model Mixtral 8x7B from the startup Mistral.

The researchers used an AI aggregation website called LiteLLM for four days, from December 11 to 15, 2023, and ran all models through a series of different prompts, including asking them to answer 57 different multiple-choice questions spanning STEM, humanities, and social sciences as part of a "knowledge-based QA" test.

In that test, "Gemini Pro performed worse than GPT-3.5 Turbo and significantly worse than GPT-4 Turbo," specifically scoring 64.12/60.63 (out of 100/100) compared to GPT-3.5 Turbo's 67.75/70.07 and GPT-4 Turbo's 80.48/78.95.

Interestingly, the researchers found that Gemini disproportionately chose option "D" multiple times when prompted with different LLMs labeled as A, B, C, or D, regardless of whether it was the correct answer or not.

"Gemini has a highly skewed label distribution, favoring the last option 'D,' which contrasts with the results of GPT models that are more balanced," the paper states. "This may indicate that Gemini lacks substantial instruction tuning for multiple-choice questions, which could result in biases in answer ordering."

Furthermore, the researchers observed that Gemini performed worse than GPT-3.5 Turbo in several specific categories of questions, particularly in human sexuality, formal logic, elementary mathematics, and professional medicine. The researchers noted that this was largely due to Gemini refusing to answer some questions, claiming it could not comply with its safety and content restrictions, which the researchers counted as incorrect responses in their scoring/benchmark tests.

Gemini Pro outperformed GPT-3.5 Turbo in two categories of multiple-choice questions - safety and high school microeconomics - but "for the two tasks where Gemini Pro performs better than GPT-3.5 Turbo, the gains are small," the researchers pointed out. Meanwhile, GPT-4 still reigns supreme among all tested models.

To be fair, the researchers carefully noted that Gemini's performance surpassed GPT-3.5 (marks assigned to different words, letter combinations, and symbols reflect the model's internal organization of different concepts) when testing LLM outputs with over 900 tokens.

The researchers also tested the models on a category of questions called "general-purpose reasoning," where no answer options were provided. Instead, the LLMs were asked to read a logical question and respond with what they believed to be the correct answer.

Once again, the researchers found that "Gemini Pro's accuracy is slightly lower than GPT-3.5 Turbo and significantly lower than GPT-4 Turbo... Gemini Pro performs poorly on longer, more complex questions, while GPT models are more resilient to this. Especially for GPT-4 Turbo, even with longer questions, it demonstrates impressive capabilities with almost no degradation, indicating strong comprehension of long-term and more complex queries."

However, Gemini does excel in one area over all other models, and that is in translation content where Google's previous expertise was concentrated: in 20 languages, "Gemini Pro outperforms GPT-3.5 Turbo and GPT-4 Turbo in 8 languages and achieves the best performance in 4 languages."

But even this result is tainted by the fact that "Gemini Pro shows a strong tendency to block responses in approximately 10 language pairs," indicating an overly zealous content moderation/safety system.

What Does This Mean for Google's AI Ambitions and Users?

The results are clearly a blow to Google's ambition to compete head-on with OpenAI in the generative AI space, and the more powerful Google Gemini Ultra model won't be available until next year, which may mean that Google will lag behind in AI performance until then.

Interestingly, the research also showcased Mistral's popular new LLM Mixtral 8x7B - which employs an "expert blending" approach where several different smaller AI models are linked together, each handling a different set of tasks most suited to their specialization - and its performance was also significantly behind OpenAI's GPT-3.5 Turbo. Gemini Pro "outperforms Mixtral in every task we examined," according to the researchers.

This highlights a bright spot in Google's AI work: it still outperforms cutting-edge open-source solutions.

However, overall, it is hard not to come away from this research with the impression that, for now, OpenAI remains the reigning champion in the field of generative AI for consumers and businesses.

AI influencers like Ethan Mollick, a professor at the Wharton School of the University of Pennsylvania, also seem to largely agree. As Mollick posted on X, "For most individual cases, you want to use the best AI, and that still seems to be GPT-4... at least until Gemini Ultra is released in the new year."