Our Splitblog in March – Deep Mind Gemini 1.5

Today we want to focus on the new AI model from Google. This is a multimodal AI model that can process various types of information, such as texts, images, program codes, and audio information and their combinations.

A topic suggestion from our developer Mats, who is primarily responsible for the development of our chatbot Kosmo

A few weeks ago, Google introduced DeepMind Gemini 1.5 – an update to the previous AI models from Google.

The amount of data that Gemini 1.5 can process is particularly groundbreaking. Up to 1 million tokens can be provided in the context window. In internal experiments, the amount of data could even be increased to 10 million tokens. A token is a kind of basic unit with which, for example, sentences can be divided into smaller units (tokens) and thus processed by the model. A token is therefore a group of characters. For comparison: Chat GPT-4 Turbo can process 128,000 tokens (as of December 2023). This is roughly equivalent to a 300-page book. If more pages were provided, the model would no longer be able to access the information on the first pages. Figuratively speaking, at the end of a book, it would no longer know the author’s name.

Gemini 1.5 can capture and analyze up to one hour of video material, eleven hours of audio recordings, texts with up to 700,000 words, or 30,000 lines of code. And, what is even more amazing, it can “remember” the content and connect it with new information.

During the presentation of the new model, Gemini 1.5 was tasked with analyzing the 402-page transcript of the Apollo 11 mission and finding three humorous passages in it. In fact, the model succeeded in identifying three entertaining moments within about 30 seconds. For example, Command Module Pilot Michael Colins said at one point: “The Tsar is brushing his teeth, so I’m stepping in for him.”

Without further information, the researchers then uploaded a hand-crafted drawing of a leaking boot and asked which moment was shown in the picture. The answer came promptly: “One small step for a man, but one giant leap for mankind.” Gemini 1.5 can therefore establish complex relationships and reproduce them correctly without concrete instructions.

  1. The architecture of the model is also advanced. It is no longer a uniform, large model approach, but a collection of smaller, specialized transformer models. This type of architecture is called Mixture of Experts (MoE). Each of these transformer models is, so to speak, an expert in its field and able to handle certain data segments or different tasks. Based on the incoming data, the most suitable model for the application is dynamically selected. For different inputs, different sub-networks of the model are activated for the appropriate outputs.

This approach enormously increases the efficiency and quality of the results delivered.

Gemini 1.5 is currently only available to selected corporate customers and developers. We are excited about the further development.