LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

Jae-Woo Choi1*, Youngwoo Yoon1*, Hyobin Ong1, 2, Jaehong Kim1, Minsu Jang1, 2
1Electronics and Telecommunications Research Institute, 2University of Science and Technology
*Equal Contribution
LoTa-Bench Overview Image
LoTa-Bench Overview

Abstract

Large language models (LLMs) have recently received considerable attention as alternative solutions for task planning. However, comparing the performance of language-oriented task planners becomes difficult, and there exists a dearth of detailed exploration regarding the effects of various factors such as pre-trained model selection and prompt construction. To address this, we propose a benchmark system for automatically quantifying performance of task planning for home-service embodied agents. Task planners are tested on two pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several enhancements of the baseline planner. We expect that the proposed benchmark tool would accelerate the development of language-oriented task planners.

Benchmark Suites

To rigorously evaluate LLM-based task planners, we introduce a comprehensive evaluation framework, LoTa-Bench. The framework integrates three key components: a task planner, a dataset, and a simulator. Our baseline task planner leverages the in-context learning capabilities of LLMs. Then we offer two distinct dataset-simulator pairings: 1) the ALFRED dataset built on the AI2-THOR simulator, and 2) an extended version of the Watch-And-Help (WAH) dataset, named WAH-NL, incorporated with the VirtualHome simulator. The following images depict samples where the LLM-based task planner, utilizing the GPT-3 175B model, has successfully planned and executed desired tasks within the simulator.

Benchmark Suite 1 Image
ALFRED and AI2-THOR: success example
Benchmark Suite 2 Image
WAH-NL and VirtualHome: success example

Base Experiments

We conducted experiments to measure the performance of the baseline LLM-based task planners by using the proposed benchmark. We tested various settings including LLMs in different model classes and sizes and the impact of the number of in-context examples. The following image shows the results on ALFRED and WAH-NL for different pre-trained LLMs.

Base Experiment 1 Image
Baseline results on ALFRED
Based Experiment 2 Image
Baseline results on WAH-NL

Extensions of Task Planner

The primary merit of the proposed benchmark is that it allows faster and easier validation of new task planners. To demonstrate this, we explore some extensions (or improvements) of the base planner and validate them.

In-Context Example Selection

We explored three strategies for selecting in-context examples from the train set: Random Sampling, Task-Specific Sampling, and Semantic Similarity Sampling. We found that across all model sizes, Semantic Similarity Sampling showed superior performance, followed by Task-Specific Sampling, and lastly Random Sampling.

Extension Experiment 1 Image
Subgoal success rate for different in-context example selection strategies on WAH-NL

Fine-tuning on Train Set

We conducted experiments to investigate the potential to improve planner performance through fine-tuning. We fine-tuned LLaMA 1 models using LoRA on the ALFRED train set and evaluated their performance in the same ALFRED domain. As depicted in the following figure, fine-tuning significantly improved performance.

Experiment 2 Image
Success rate of fine-tuned planners on ALFRED

Feedback and Replanning

Adjusting the plan in response to the failure of the action is necessary for task planning in the wild. We investigated whether our LLM-based task planner can reflect feedback from action failure and replan appropriately. The qualitative results showing how the planner succeeded by replanning steps are shown in the following figure.

Extension Experiment 3 Image
Success case of replanning on ALFRED

BibTeX

@inproceedings{choi2024lota,
  title={LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents},
  author={Choi, Jae-Woo and Yoon, Youngwoo and Ong, Hyobin and Kim, Jaehong and Jang, Minsu},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}