Jump to preprints or selected publications.
Preprints
V-STaR: Training Verifiers for Self-Taught Reasoners
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal
Self-improvement approaches, such as ReST^EM and STaR, discard all the LLM-generated incorrect solutions during training. V-STaR
augments such approaches by training a verifier using both correct and incorrect solutions, used at test-time for re-ranking LLM generations.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Avi Singh*, JD Coreyes*, Rishabh Agarwal* et al.
We explore whether we can go beyond human data on tasks where we have access to scalar feedback, finding that
a simple self-training method based on expectation-maximization can substantially reduce dependence on human-generated data.
Selected Publications
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
Rishabh Agarwal*, Nino Vieillard*, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
ICLR 2024
GKD tackles distribution-mismatch in distilling autoregressive models, and outperforms commonly-used approaches on distilling LLMs for
summarization, translation and reasoning tasks.
Bigger, Better, Faster: Human-level Atari with human-level efficiency
Max Schwarzer, Johan Obando-Ceron, Aaron Courville, Marc Bellemare, Pablo Samuel Castro*, Rishabh Agarwal*
ICML 2023
With scaling compute and model size along with appropriate design choices, value-based methods achieve super-human performance in the Atari 100K, while being 4x compute efficient than SOTA.
Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes
Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, Sergey Levine
ICLR 2023 (Top 5%), NeurIPS 2022 DRL Workshop Best paper Runner-up
With appropriate design choices, offline Q-learning exhibit strong performance that scales with model capacity. The secret ingredients
were training on a large and diverse offline dataset with ResNets, distributional C51 backups and feature normalization (that is make RL training look more like SL).
Beyond Tabula Rasa: Reincarnating Reinforcement Learning
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare
NeurIPS 2022
This work proposes an alternative research workflow to tabula rasa RL, where prior computational work (e.g., learned policies) is transferred from agent to another.
Deep RL at the Edge of the Statistical Precipice
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare
NeurIPS 2021 (Outstanding Paper Award)
Our findings call for a change in how we evaluate performance on deep RL benchmarks, for which we present more reliable protocols and an
open-source library , easily applicable with *even a handful of runs*, to prevent unreliable results
from stagnating the field.
Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning
Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G. Bellemare
ICLR 2021 (Spotlight)
To improve generalization, we learn representations, via a contrastive loss, that puts states together with similar long-term optimal behavior. This is orthogonal to existing
approaches such as data augmentation. An earlier version was accepted as an oral presentation at NeurIPS 2020 Workshop on Biological and Artificial RL.
Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning
Aviral Kumar*, Rishabh Agarwal*, Dibya Ghosh, Sergey Levine
ICLR 2021
We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions,
approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by
previous instances of the value network, more gradient updates decrease the expressivity of the current value network.
RL Unplugged: Benchmarks for Offline Reinforcement Learning
Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gómez Colmenarejo, Konrad Zolna, Rishabh Agarwal,
Josh Merel, Daniel Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matt Hoffman, Ofir Nachum,
George Tucker, Nicolas Heess, Nando de Freitas
NeurIPS 2020
We propose a benchmark called RL Unplugged to evaluate and compare offline RL methods on a diverse range of domains. We provide detailed evaluation
protocols for each domain and provide an extensive analysis of existing methods using these protocols. We hope that our suite of benchmarks will
increase the reproducibility in offline RL and make it possible to study challenging tasks with a limited computational budget, thus making RL research
both more systematic and more accessible across the community.
An Optimistic Perspective on Offline Reinforcement Learning
Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi
ICML 2020 (Talk)
This paper popularized offline RL and showed that standard off-policy algorithms perform quite well in the fully
off-policy / offline deep RL setting with large and diverse datasets. A previous version was titled "Striving for Simplicity in Off-Policy Deep Reinforcement Learning"
and presented as a contributed talk at NeurIPS 2019 DRL workshop.
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal
Self-improvement approaches, such as ReST^EM and STaR, discard all the LLM-generated incorrect solutions during training. V-STaR augments such approaches by training a verifier using both correct and incorrect solutions, used at test-time for re-ranking LLM generations.
Avi Singh*, JD Coreyes*, Rishabh Agarwal* et al.
We explore whether we can go beyond human data on tasks where we have access to scalar feedback, finding that a simple self-training method based on expectation-maximization can substantially reduce dependence on human-generated data.
Rishabh Agarwal*, Nino Vieillard*, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
ICLR 2024
GKD tackles distribution-mismatch in distilling autoregressive models, and outperforms commonly-used approaches on distilling LLMs for summarization, translation and reasoning tasks.
Max Schwarzer, Johan Obando-Ceron, Aaron Courville, Marc Bellemare, Pablo Samuel Castro*, Rishabh Agarwal*
ICML 2023
With scaling compute and model size along with appropriate design choices, value-based methods achieve super-human performance in the Atari 100K, while being 4x compute efficient than SOTA.
Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, Sergey Levine
ICLR 2023 (Top 5%), NeurIPS 2022 DRL Workshop Best paper Runner-up
With appropriate design choices, offline Q-learning exhibit strong performance that scales with model capacity. The secret ingredients were training on a large and diverse offline dataset with ResNets, distributional C51 backups and feature normalization (that is make RL training look more like SL).
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare
NeurIPS 2022
This work proposes an alternative research workflow to tabula rasa RL, where prior computational work (e.g., learned policies) is transferred from agent to another.
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare
NeurIPS 2021 (Outstanding Paper Award)
Our findings call for a change in how we evaluate performance on deep RL benchmarks, for which we present more reliable protocols and an open-source library , easily applicable with *even a handful of runs*, to prevent unreliable results from stagnating the field.
Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G. Bellemare
ICLR 2021 (Spotlight)
To improve generalization, we learn representations, via a contrastive loss, that puts states together with similar long-term optimal behavior. This is orthogonal to existing approaches such as data augmentation. An earlier version was accepted as an oral presentation at NeurIPS 2020 Workshop on Biological and Artificial RL.
Aviral Kumar*, Rishabh Agarwal*, Dibya Ghosh, Sergey Levine
ICLR 2021
We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network.
Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gómez Colmenarejo, Konrad Zolna, Rishabh Agarwal,
Josh Merel, Daniel Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matt Hoffman, Ofir Nachum,
George Tucker, Nicolas Heess, Nando de Freitas
NeurIPS 2020
We propose a benchmark called RL Unplugged to evaluate and compare offline RL methods on a diverse range of domains. We provide detailed evaluation protocols for each domain and provide an extensive analysis of existing methods using these protocols. We hope that our suite of benchmarks will increase the reproducibility in offline RL and make it possible to study challenging tasks with a limited computational budget, thus making RL research both more systematic and more accessible across the community.
Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi
ICML 2020 (Talk)
This paper popularized offline RL and showed that standard off-policy algorithms perform quite well in the fully off-policy / offline deep RL setting with large and diverse datasets. A previous version was titled "Striving for Simplicity in Off-Policy Deep Reinforcement Learning" and presented as a contributed talk at NeurIPS 2019 DRL workshop.