Technology

Google says its Gemini AI outperforms both GPT-4 and expert humans

The Gemini artificial intelligence comes in a variety of sizes, with Google saying its mid-range version will be incorporated into its Bard chatbot and available to the public from today

By Matthew Sparkes

6 December 2023

Gemini can handle text, audio and video
Google

Google has launched a new AI model, dubbed Gemini, which it claims can outperform both OpenAI’s GPT-4 model and “expert level” humans in a range of intelligence tests.

AIs can trick each other into doing things they aren't supposed to

The firm’s CEO, Sundar Pichai, revealed the existence of Gemini at Google’s I/O conference in May this year, although it was still in training at the time. But today the company has announced that it will be launching the cutting-edge model to the public.

Three versions of Gemini have been created for different applications, called Nano, Pro and Ultra, which increase in size and capability. Google declined to answer questions on the size of Pro and Ultra, the number of parameters they include or the scale or source of their training data. But its smallest version, Nano, which is designed to run locally on smartphones, is actually two models: one for slower phones that has 1.8 billion parameters and one for more powerful devices that has 3.25 billion parameters. Comparing the capabilities of AI models is an inexact science, but GPT-4 is rumoured to include up to 1.7 trillion parameters and Meta’s LLAMA-2 has 70 billion.

The mid-range Pro version of Gemini beats some other models, such as OpenAI’s GPT3.5, but the more powerful Ultra exceeds the capability of all existing AI models, Google claims. It scored 90 per cent on the industry-standard MMLU benchmark, where an “expert level” human is expected to achieve 89.8 per cent.

This is the first time an AI has beaten humans at the test, and is the highest score for any existing model. The test involves a broad range of tricky questions on topics including logical fallacies, moral problems in everyday scenarios, medical issues, economics and geography.

In the same test, GPT-4 scored 87 per cent, LLAMA-2 scored 68 per cent and Anthropic’s Claude 2 scored 78.5 per cent. Gemini beat all those models in eight out of nine other common benchmark tests.

The Pro model will be integrated into Google’s Bard, an online chatbot that was launched in March this year. The company says that another version of Bard called Bard Advanced will launch early next year and feature the larger Gemini Ultra model.

GPT-4 gives medical advice that saves doctors' time but can be harmful

The new version of Bard will be available in English in more than 170 countries as of today, but it won’t be available in other languages or even in English across the UK and Europe. Sissie Hsiao at Google says the delay is down to regulation rather than engineering: “We’re working with local policies and regulators to make sure that we’re abiding by local laws and other such things before we launch in other areas.”

Eli Collins at Google DeepMind says Gemini is the company’s largest and most capable model, but also its most general – meaning it is adaptable to a variety of tasks. Unlike many current models that focus on text, Gemini has been trained on text, images and sound and is claimed to be able to accept inputs and provide outputs in all those formats. But the Bard launch will only allow people to use text prompts as of today, with the company promising to allow audio and image interaction “in coming months”.

Collins says that Gemini is “state of the art in nearly every domain” and that it is still in testing to determine exactly how capable it is at working in different mediums, languages and applications. “We’re still working to understand all of Ultra’s novel capabilities,” he says.

No versions of Gemini were made available for testing at the launch event, but Google showed demonstrations of the AI solving homework problems and working with live video input. It is also claimed to be better at developing software than previous models: last year, DeepMind released an AI-powered code generator called AlphaCode that the firm said could beat 50 per cent of human developers, and it is now releasing an updated version powered by Gemini that it claims can beat 85 per cent of human coders.

Topics: