Google has created a robot brain called PaLM-E that can do many different things when people tell it what to do.

 SHORT: Google and some researchers from a university made a new smart robot that can do many things without being told how to do each one. It can see things and understand what they mean, like finding chips in a drawer, and then make a plan to get them. The robot doesn't need people to tell it what to do, it can figure it out on its own. The researchers made it by teaching it with a lot of pictures and words. They hope it can be used to help people at home or work.


LONG: Google and the Technical University of Berlin recently unveiled a multimodal embodied visual-language model (VLM) with 562 billion parameters named PaLM-E, which integrates vision and language for robotic control. PaLM-E is claimed to be the largest VLM developed to date and can perform a variety of tasks without the need for retraining. When given a high-level command, such as "bring me the rice chips from the drawer," PaLM-E can generate a plan of action for a mobile robot platform with an arm developed by Google Robotics and execute the actions by itself. The model analyses data from the robot's camera without needing a pre-processed scene representation, eliminating the need for a human to pre-process or annotate the data, and allowing for more autonomous robotic control. PaLM-E turns instructions into actions and is a next-token predictor. It takes continuous observations like images or sensor data and encodes them into a sequence of vectors that are the same size as language tokens. This allows the model to understand the sensory information in the same way it processes language. PaLM-E is not just a language model but is also an embodied model, integrating sensory information and robotic control.


PaLM-E is also resilient and can react to its environment. For example, it can guide a robot to get a chip bag from a kitchen and becomes resistant to interruptions that might occur during the task. In addition, PaLM-E exhibits "positive transfer," which means it can transfer the knowledge and skills it has learned from one task to another, resulting in significantly higher performance compared to single-task robot models. PaLM-E also sets a new state-of-the-art on the OK-VQA benchmark.


The researchers have observed several interesting effects that apparently come from using a large language model as the core of PaLM-E. For example, they have observed that "the larger the language model, the more it maintains its language capabilities when training on visual-language and robotics tasks—quantitatively, the 562B PaLM-E model nearly retains all of its language capabilities."


PaLM-E is a significant step forward in the field of multimodal embodied visual-language models and robotic control. Google researchers plan to explore more applications of PaLM-E for real-world scenarios such as home automation or industrial robotics. As deep learning models get more complex over time, the researchers hope that PaLM-E will continue to exhibit new emergent capabilities, further improving its usefulness for robotic control and beyond.

Comments

Popular posts from this blog

OpenAI Chat Writes Blogs

Anti-Hype Machine and A Change to Website