![]() Below we show a challenging case, where Human feedback changes the goal during the plan execution, and then changes the goal yet again by saying “finish the previous task”. In this section, we demonstrate a few of these emergent capabilities.Īlthough not explicitly prompted, the LLM planner can react to human interaction that changes the high-level goal mid-task. Using a pre-trained LLM as the backbone, the method also inherits many of the appealing properties from its versatility and general-purpose languageunderstanding. We demonstrate this can elicit complex replanning behaviors in tasks that require combinatorial state spaces (e.g., "put all blocks in bowls with matching colors", "stack all the blocks").Īlthough LLMs can generate fluent continuations from the prompted examples, we surprisingly find that when informed with environment feedback, Inner Monologue demonstrates many impressive reasoning and replanning behaviors beyond the examples given in the prompt. Additionally, the LLM also generates chain-of-thought text about what remains to be achieved after each step. ![]() Then, the scene description keeps track of the achieved sub-tasks after each step. Specifically, the LLM first infers desired sub-tasks given the high-level instruction. The video below shows one instantiation of using passive scene description as feedback ( Scene). Given an unseen task instruction, we show that LLMs can not only generate sensible action plans as observed in previous works, but can also incorporate injected textual feedback of success detection and passive scene description. ![]() For more details about experiments, implementations, and the prompts used for LLM for each domain, please refer to the paper and the appendix. We study different Inner Monologue implementations in three environments with different LLM planning methods and different sources of feedback from the environment. As Inner Monologue is not dependent on a specific LLM or a type of grounding feedback, In order to study how different sources of environment feedback can support a rich inner monologue that enables complex robotic control, we analyze diverse long-horizon manipulation and navigation tasks in simulation and in the real world. Refers to a binary feedback that indicates if the last action was successful, which is particularly useful in many long-horizon settings. This can also be repurposed to inject human preference during plan generation. Scene description describes any feedback that is consistently provided in a structured form, such as object recognition results.Īctive scene description, on the other hand, describes any free-form questions that LLMs may ask and the corresponding unstructured answers provided byĪ learned model (e.g. We focus our studies on three types of feedback: passive scene description, active scene description, and success detection. While any textual feedback can be incorporated, In this work, formulate an inner monologue by continuallyĪdding information from various sources of feedback into the language model prompts. This issue is particularly prominent when an intermediate action failsĭuring execution, because the LLM is not informed with any feedback. However, it has remained one-directional - the LLM blindly influences the agent and theĮnvironment, but no feedback is routed back to the LLM. Prior works have shown large language models (LLMs) demonstrate impressive planning capabilities for long-horizon embodied tasks, We find that closed-loop language feedback significantly improves high level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a real kitchen environment. We investigate a variety of sources of feedback, such as success detection, object recognition, scene description, and human interaction. ![]() We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent’s own choices. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to language. Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robotics.
0 Comments
Leave a Reply. |