Google recently unveiled a new Pro iteration of its latest AI, Gemini, with company insiders indicating superior performance compared to GPT-3.5 (the free version of ChatGPT) during comprehensive testing. Performance assessments reveal that Gemini Ultra surpasses existing state-of-the-art results on 30 out of 32 widely recognized academic benchmarks utilized in the research and development of large language models (LLM). Despite previous perceptions of Google trailing behind OpenAI's ChatGPT, considered widely as the most popular and powerful in the AI domain, Google asserts that Gemini is designed to be multimodal. This implies its capability to process diverse forms of media, including text, images, video, and audio.
Insider reports reveal that Gemini Ultra has achieved a groundbreaking score of 90.0%, marking the first instance of a model surpassing human experts in Massive Multitask Language Understanding (MMLU). MMLU incorporates 57 subjects, including mathematics, physics, history, law, medicine, and ethics, testing both world knowledge and problem-solving skills.
The Google-based AI, part of the Gemini platform, is available in three sizes: Ultra (the flagship model), Pro, and Nano (optimized for mobile devices). According to TechCrunch, Google plans to make Gemini Pro accessible to enterprise customers through its Vertex AI program and to developers in AI Studio starting December 13. Additionally, reports suggest that the Pro version can be accessed through Bard, the company's chatbot interface.
Eli Collins, VP of product at DeepMind (Google's division responsible for expanding the AI platform), informs TechCrunch that Gemini Ultra demonstrates the ability to comprehend "nuanced" information across text, images, audio, and code.
Collins also notes that while a portion of the app's development data is sourced from public web data, the company does not directly address the specifics of the training data sources.
Recommended by LinkedIn
When assessing a large language model, people often focus on its parameter count as a key metric
Essentially, parameters are numeric variables representing the acquired knowledge of a model, enabling it to predict and generate text based on input. Generally, a higher parameter count implies greater potential for diverse and accurate outputs, but it also demands more computational resources and memory for training and usage. GPT-4 boasts one trillion parameters, six times larger than GPT-3.5 with its 175 billion parameters, making it one of the largest language models ever created.
Concerning Gemini, Google introduces four sizes: Gekko, Otter, Bison, and Unicorn. While exact parameter counts are undisclosed, hints suggest Unicorn is the largest, likely comparable to GPT-4, if not slightly smaller. Notably, Gemini stands out for its interactivity and creativity compared to other Large Language Models (LLMs). It can produce outputs in various modalities based on user preferences and generate novel, diverse content unrestricted by existing data or templates. For instance, Gemini can generate original images or videos from text descriptions, sketches, or create stories and poems based on images or audio clips. Now, let's delve into how Gemini, while not necessarily outsmarting, excels in performing more varied and extended tasks compared to GPT-4. Here are a few examples, starting with multi-modal question answering.
Involving various data types like text and images, Gemini tackles multi-modal questions, such as identifying the author from a book cover image or naming an animal from a picture. It excels in multi-modal summarization, condensing diverse data like text and audio into short summaries. Gemini also handles multi-modal translation, generating subtitles for videos or dubbing in other languages using textual and visual translation skills. The system extends its capabilities to multi-modal generation, producing content like images from text descriptions or text from images.
Gemini's standout feature is multi-modal reasoning, allowing it to answer complex questions about, for instance, a movie's main theme by synthesizing information from various modalities. This ability enables it to discern patterns, understand character interactions, and uncover hidden messages in films, providing a comprehensive understanding.
The technology's potential goes beyond what can be covered in this blog, showcasing its incredible power and versatility. Looking ahead, Google's multi-modal approach with Gemini is poised to challenge GPT-4 and possibly GPT-5, foreseeing more applications for personalized assistants and creative tools. This suggests a future where Gemini's capabilities enhance user experiences and offer innovative solutions across diverse modalities.