LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

1University of California, Berkeley, 2Google DeepMind
Teaser Image


Sample tasks from LMRL-Gym benchmark for the development of RL algorithms for language

Abstract

Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework containing a basic toolkit for getting started on multi-turn RL with offline value-based and policy-based RL methods. The motivation behind our work is particularly apparent in multi-turn conversations: even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions now that lead to better decisions after multiple turns. Reinforcement learning has the potential to leverage the powerful modeling capabilities of LLMs, as well as their internal representation of textual interactions, to create capable goal-directed language agents. This can enable intentional and temporally extended interactions, such as with humans, through coordinated persuasion and carefully crafted questions, or in goal-directed play through text games to bring about desired final outcomes. Our contributions are:

Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework containing a basic toolkit for getting started on multi-turn RL with offline value-based and policy-based RL methods. The motivation behind our work is particularly apparent in multi-turn conversations: even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions now that lead to better decisions after multiple turns.

  • Eight different language tasks which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games. Each task has both an online simulator and offline datasets avaliable for training.
  • Stable and reliable research framework with reinforcement learning algorithms that can effectively train LLMs.
  • Methodology for synthetic data generation system to create further datasets for RL-LLM algorithm development
  • Design of tasks that isolate specific RL capabilities to test success of algorithms

A central objective of our benchmark is to evaluate the core capabilities that RL can enable in large language models. Some of these capabilities are computational, and relate to core decision making irrespective of the considerations of natural language, such as playing chess, while others are semantic. These include:

  • Strategic decision making: RL shines in goal-directed tasks that require multi-step planning and strategic decision making, such as follow-up questions to gather information (e.g., in the 20 Questions task), to complex strategy in chess.
  • Complex language: Our benchmark includes realistic language and interaction scenarios, with the aim of necessitating LLMs to combine knowledge from pretraining with task-specific patterns learned during finetuning.
  • Credit assignment: As rewards are often delayed relative to the action that was pivotal to the outcome, we evaluate the ability of RL algorithms to determine trajectories that lead to good outcomes, and reinforce them.
  • Partial observability: For language tasks, the state consists of the entire history of tokens, and an agent may need to examine this entire context to infer the correct state. This is tested through algorithms learning the mental states of the speaker (e.g., whether the buyer is impatient in a selling task) or previously observed facts in a guessing game.
  • Trajectory stitching: It is necessary for algorithms to learn how to join optimal actions from different suboptimal trajectories together to form the most optimal trajectory

Evaluating Capabilities Enabled by RL

LMRL-Gym Tasks Image

A central objective of our benchmark is to evaluate the core capabilities that RL can enable in large language models. Some of these capabilities are computational, and relate to core decision making irrespective of the considerations of natural language, such as playing chess, while others are semantic. These include:

  • Strategic decision making: RL shines in goal-directed tasks that require multi-step planning and strategic decision making, such as follow-up questions to gather information (e.g., in the 20 Questions task), to complex strategy in chess.
  • Complex language: Our benchmark includes realistic language and interaction scenarios, with the aim of necessitating LLMs to combine knowledge from pretraining with task-specific patterns learned during finetuning.
  • Credit assignment: As rewards are often delayed relative to the action that was pivotal to the outcome, we evaluate the ability of RL algorithms to determine trajectories that lead to good outcomes, and reinforce them.
  • Partial observability: For language tasks, the state consists of the entire history of tokens, and an agent may need to examine this entire context to infer the correct state. This is tested through algorithms learning the mental states of the speaker (e.g., whether the buyer is impatient in a selling task) or previously observed facts in a guessing game.
  • Trajectory stitching: It is necessary for algorithms to learn how to join optimal actions from different suboptimal trajectories together to form the most optimal trajectory

Tasks in LMRL-Gym Benchmark

LMRL-Gym Tasks Image

We feature sample trials from the tasks above. Each task requires the agent to perform a multi-turn interaction with an environment -- either a text game or another LLM simulating a human speaker.



LMRL-Gym Tasks Image
To generate data for conversational tasks, we use LLMs as "simulators" for the task, serving to generate offline data, to provide a simulation environment for evaluation and online training, and to compute rewards.

Our Results

LMRL-Gym Tasks Image

We evaluate our tasks on a set of both online and offline RL algorithms. To make the results more comparable across tasks, we normalize the average return for each policy such that 0 is the minimum possible return, 50 is the dataset average return, and 100 is the maximum return for each task. We also report the raw score results and evaluation details in our Appendix.

BibTeX

@article{abdulhai2023lmrl,
  author    = {Abdulhai, Marwa and White, Isadora and Snell, Charlie and Sun, Charles and Hong, Joey and Zhai, Yuexiang and Xu, Kelvin and Levine, Sergey},
  title     = {LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models},
  year      = {2023},
}