One standout presentation at this year’s Google I/O event is Project Astra, a groundbreaking AI assistant with multimodal capabilities. Described as an “All-Seeing AI cloud that lives in your glasses” by Demis Hassabis, CEO of Google DeepMind, Astra has been a long-time vision for him.
Astra represents a real-time AI assistant that operates across different modalities, comprehends context, and promptly responds to user queries via text, audio, or video inputs in a conversational manner. Powered by Google’s Gemini 1.5 model, Astra continuously analyzes video and speech inputs to understand its surroundings. During the demonstration, a Google employee showcased Astra’s capabilities using just a smartphone camera. The AI assistant effortlessly identified objects, answered questions about code snippets, and even recognized landmarks like the King’s Cross area of London by looking out the window.
Furthermore, Astra was able to identify objects in the room, explain specific code segments, determine locations just by visual input, and even generate creative names for animals.
Google demonstrated Astra’s use on smartphones or smart glasses, hinting at a potential Gemini-powered overhaul of Google Lens in the future.
Google showcased Astra’s powerful capabilities at Google I/O, enhancing information processing speed through techniques like video frame encoding and combining video and speech input into a timeline of events.
Additionally, Google has improved its AI assistant’s audio to sound more natural, offering users various voice options. Hassabis emphasized the goal of creating a “Universal Assistant” to assist in everyday tasks.
In another intriguing demonstration, Google showed Project Astra “watching” the keynote alongside an employee, suggesting potential desktop integration in the future. Google plans to introduce Project Astra to the Gemini app and other products later this year. While there’s no specific launch date yet, Google is optimistic about offering these functionalities.
CEO Sundar Pichai referred to Astra as their “Vision for the future of AI assistants,” highlighting the company’s commitment to its development. A new model called Veo can generate video content based on simple text inputs, and Gemini Nano, designed for local device usage, such as smartphones, is reportedly faster.
Gemini Pro’s context window, indicating the amount of information the model can process per query, has doubled to about 2 million tokens, with Google claiming improved performance. Google is making rapid progress on both the models themselves and getting them into the hands of users.