In a cluttered, open-plan office in Mountain View, California, a tall, slender, wheeled robot is busy acting as a tour guide and office helper, thanks to a major upgrade to Google DeepMind’s language models. Apparently todayThe robot is using the latest version from Google Gemini Large Scale Language Model Parse the command to find out how.
For example, a human can say, “Find me a place where I can write,” and the robot will dutifully walk off and guide the human to a clean whiteboard somewhere in the building.
Not only can Gemini process video and text, but it can also ingest large amounts of information in the form of historical video tours of the office, allowing it to understand its surroundings and navigate correctly when given commands that require common sense reasoning. The robot combines Gemini with algorithms that generate specific actions for the robot to take (such as turning) depending on the command and what it sees in front of it.
When Gemini was announced in December, Google DeepMind CEO Demis Hassabis said: He told WIRED The robot’s multimodal capabilities could unlock new robotic capabilities, he said, adding that the company’s researchers are working hard to test the model’s robotic potential.
in New Paper Outlining the project, the researchers say the robot has proven capable of navigating difficult instructions, such as “where did you put your coaster,” with up to 90 percent accuracy. DeepMind’s system “significantly improved the naturalness of human-robot interaction and significantly enhanced the robot’s ease of use,” the team wrote.
This demo is Large-scale language models Reaching out into the physical world to do useful work. Gemini and others Chatbots They primarily work within web browsers and apps, but are increasingly able to handle visual and auditory input as well. Google and OpenAI Recently, Hassabis said in May Gemini upgrade You can see the layout of your office through your smartphone camera.
Academic and industrial research labs are racing to discover how language models can be used to enhance robots’ capabilities. program The International Conference on Robotics and Automation, a popular event for roboticists, lists about 20 papers on the use of visual language models.
Investors Pouring money It is investing in startups that aim to apply AI advances to robotics. Some of the researchers involved in the Google project have since left the company. Physical IntelligenceWith $70 million in early funding, the company is working to combine large-scale language models with real-world training to give robots general problem-solving abilities. Skilled AIThe company, founded by roboticists at Carnegie Mellon University, has a similar goal and announced a $300 million funding round this month.
Just a few years ago, robots needed a map of their surroundings and carefully chosen commands to navigate successfully. Large language models contain useful information about the physical world, and newer versions, called visual language models, trained on images, videos and text, can answer questions that require perception. Gemini allows Google’s robots to parse visual and spoken instructions, following a route sketched on a whiteboard to a new destination.
The researchers say in their paper that they plan to test the system with different kinds of robots, adding that Gemini should also be able to understand more complex questions, such as “Do you have my favorite drink today?” from a user who has a bunch of empty Coca-Cola cans on their desk.