Google DeepMind recently put a new 96 page paper on arXiv titled Imitating Interactive Intelligence. The paper actually has a few really cool things to contribute. Unfortunately, owing to its very intimidating length, it’s not easy to actually gain these insights. DeepMind have actually put up their own blog post on the work, but I didn’t really get a sense for what happened just by reading that post. They also released three videos showing what happened in the project (overview, training timelapse, and a demo). This write-up will summarize:

What the paper is trying to do,
What it actually does,
Whether or not their approach works or insights are accurate, and
What machine learning/robotics researchers can take away from the paper.

What is the paper trying to do?

This paper is targeted at developing agents which have “interactive intelligence.” Effectively, if we have a robot in a child's bedroom and we tell it to “Put the toy helicopter back on the shelf,” then can our robots understand that command and execute it? If we ask the robot “What color is the notebook on the desk?” then can our robots answer the question? If we want the robot to coordinate other agents in a cleanup task, can it generate sub-commands that conform to specific instances of “Tell people to put things away?”

DeepMind extends their prior success in imitation learning (IL) and reinforcement learning (RL) to language learning, vision learning, and task learning in their new Playroom Unity environment (shown below), an interactive environment for simple task completion. After gathering roughly 2 years of data, the authors ask the question: Can we learn to understand and respond to language commands in the Playroom? Can we learn to be an interactive, intelligent agent in the Playroom by imitating 2 years of expert data?

Figure 1 from the paper itself. Panel A shows a closeup of a Playroom, with two agents looking at a toy helicopter. Panel B shows four examples of the dynamic, random configurations of the Playroom. Panel C shows examples of various objects which c… — Figure 1 from the paper itself. Panel A shows a closeup of a Playroom, with two agents looking at a toy helicopter. Panel B shows four examples of the dynamic, random configurations of the Playroom. Panel C shows examples of various objects which could be randomly distributed in a Playroom. Image source: https://arxiv.org/pdf/2012.05672.pdf

What do they do?

The authors compare several training schemes, loss functions, model architectures, dataset sizes, and success metrics to see what works best for learning interactive intelligence. Ultimately, they introduce several interesting problems (how do we know when tasks are successfully assigned or completed without having humans label every episode?), empirically determine that it’s important to balance various objectives, and produce a tangible, end-to-end system that takes a concrete step towards imitating interactive intelligence from a large human dataset.

Importantly, there is not an actual robot deployment (this is entirely in simulation), the entirety of the dataset/task is confined to the Playroom, and unfortunately there isn’t much shared with the community outside of the paper (no code, environment, or data).

Does it actually work?

The authors’ approach does seem to do quite well! Owing to their thorough ablation and sweep over architectures, training procedures, dataset sizes, and more, we can be reasonably confident that their approach would work well for learning situated intelligence in Playroom-like environments (given their assumptions of constrained vocabularies, action spaces, etc).

What can researchers take away from this paper?

1) Use Imitation Learning + Reinforcement Learning + Representation Learning

Following DeepMind’s previous IL+RL success in AlphaStar and related recent works from others at CVPR, UAI, and more, Imitating Interactive Intelligence adds evidence for the success of IL + RL for everything from task-learning to language-grounding, generalization, and transfer learning. The authors here contribute an in-depth comparison of different techniques that can be combined to learn task this interactive intelligence. For interactive task completion, question-answering, and producing language commands, the authors conclude that the best way to learn is to combine behavior cloning with RL and a few auxiliary tasks for representation learning (such as “object in scene prediction” and “goal-trajectory matching”).

2) Use Transformers for Multi-Modal Scene Embeddings

This paper is also the latest showing just how good transformers are. The authors use ResNet to embed images into flat vectors, then pass image and word embeddings into a multi-modal transformer to obtain scene embeddings. These scene embeddings then go to an LSTM, where the hidden-states are used for the various policy heads (as shown in the figure from the paper below). Nothing here is particularly surprising or new, but the authors do a very exhaustive comparison of different architectures, and it’s worth noting that, once again, transformers are just so good at learning scene embeddings.

Figure 5 in the original paper. Images are embedded using ResNet, words are tokenized into a 500-word vocabulary, and then these embeddings all go to a transformer. Transformer tokens go into an LSTM, which produces the hidden states for the various… — Figure 5 in the original paper. Images are embedded using ResNet, words are tokenized into a 500-word vocabulary, and then these embeddings all go to a transformer. Transformer tokens go into an LSTM, which produces the hidden states for the various policy heads. Image source: https://arxiv.org/pdf/2012.05672.pdf

The authors do not share their dataset, their code, their environments, or their learned models. A tremendous amount of space in the paper is dedicated to explaining the problem setup, the data collection process, the annotation process, and very fine-grained details in the appendix. For anyone looking to get into reinforcement learning, there are a myriad of design decisions, little bugs to track down, and headaches to overcome. The appendix here covers all of it. It’s one of my favorite parts of the paper, just because it is a write-up of all of the things I usually want to document for my own research.

Closing Thoughts & Discussion

Last year, DeepMind released AlphaStar, an AI StarCraft II player that could compete at a professional level in a very complex real-time strategy game. While DeepMind had success in Go and Chess by just using reinforcement learning, it turns out that reinforcement learning isn’t enough for extremely complex tasks (like StarCraft II). Their StarCraft II AI learned by copying the strategies of millions of human players while also learning by exploration and self-play. This work extends that line of thinking, effectively trying to learn the very complex “game” of interactive intelligence by imitating and supervising over human interactions and applying reinforcement learning. What’s particularly cool here is that the reinforcement learning problem for interactive intelligence isn’t very well-defined, so they apply GAIL to estimate a dynamic reward function for their agent.

This work was one of the first I’ve seen to directly approach NLP from a situated, interactive standpoint. As highlighted in prior work, trying to learn language just by reading the internet is a weird and probably bad idea. Learning language by interacting with the world, copying other people’s behavior, and completing tasks makes much more sense and is more “how we learn.” I really like this direction, and I’m excited to see more research with this motivation.

Finally, I want to re-iterate how great the effort at reproducibility is here. While we don’t have code or data, the authors put a tremendous amount of work into documenting everything they tested, all of the hyperparameter choices they made, all of their experiments, etc. It’s great to see, and it will be a great resource for beginner RL researchers.

Works linked in this post and other cool work in the area:

Abramson, Josh, et al. "Imitating Interactive Intelligence." arXiv preprint arXiv:2012.05672 (2020).
Bisk, Yonatan, et al. "Experience grounds language." arXiv preprint arXiv:2004.10151 (2020).
Ching-An Cheng, Xinyan Yan, Nolan Wagener, Byron Boots “Fast Policy Learning through Imitation and Reinforcement.” UAI 2018: 845-855.
Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in neural information processing systems 29 (2016): 4565-4573.
Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484-489.
Team, AlphaStar. "Alphastar: Mastering the real-time strategy game starcraft ii." DeepMind blog 24 (2019) https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
Wang, Xin, et al. "Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
Ward, Tom, et al. "Using Unity to Help Solve Intelligence." arXiv preprint arXiv:2011.09294 (2020).
Wu, Ga, et al. "Deep language-based critiquing for recommender systems." Proceedings of the 13th ACM Conference on Recommender Systems. 2019.