Jump to preprints or selected publications.
Preprints
Generative Verifiers: Reward Modeling as Next-Token Prediction
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal
Training generative reward models (GenRM) using next-token prediction, jointly on verification and solution generation. Such generative verifiers can use chain-of-thought (CoT) reasoning,
and additional test-time compute via majority voting for better verification. Generative CoT verifiers trained on GSM generalizes to much harder problems in MATH!
Not All LLM Reasoners are Created Equal
Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville, Rishabh Agarwal
LLMs, especially smaller and cost-efficient LLMs, exhibit systematic differences in their reasoning abilities, despite what their performance on standard math benchmarks indicates.
Selected Publications
Many-Shot In-Context Learning
Rishabh Agarwal*, Avi Singh*, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan et al.
NeurIPS (Spotlight), Oral@ICML Long-Context Workshop
Explores ICL with hundreds or thousands of examples. Unlike few-shot learning, many-shot learning is
effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning.
V-STaR: Training Verifiers for Self-Taught Reasoners
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal
CoLM 2024
Self-improvement approaches, such as ReST^EM and STaR, discard all the LLM-generated incorrect solutions during training. V-STaR
augments such approaches by training a verifier using both correct and incorrect solutions, used at test-time for re-ranking LLM generations.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Avi Singh*, JD Coreyes*, Rishabh Agarwal* et al.
TMLR 2024
We explore whether we can go beyond human data on tasks where we have access to scalar feedback, finding that
a simple self-training method based on expectation-maximization can substantially reduce dependence on human-generated data.
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
Rishabh Agarwal*, Nino Vieillard*, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
ICLR 2024
GKD tackles distribution-mismatch in distilling autoregressive models, and outperforms commonly-used approaches on distilling LLMs for
summarization, translation and reasoning tasks. Used for post-training distillation for Gemma-2.
Bigger, Better, Faster: Human-level Atari with human-level efficiency
Max Schwarzer, Johan Obando-Ceron, Aaron Courville, Marc Bellemare, Pablo Samuel Castro*, Rishabh Agarwal*
ICML 2023
With scaling compute and model size along with appropriate design choices, value-based methods achieve super-human performance in the Atari 100K, while being 4x compute efficient than SOTA.
Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes
Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, Sergey Levine
ICLR 2023 (Top 5%), NeurIPS 2022 DRL Workshop Best paper Runner-up
With appropriate design choices, offline Q-learning exhibit strong performance that scales with model capacity. The secret ingredients
were training on a large and diverse offline dataset with ResNets, distributional C51 backups and feature normalization (that is make RL training look more like SL).
Beyond Tabula Rasa: Reincarnating Reinforcement Learning
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare
NeurIPS 2022
This work proposes an alternative research workflow to tabula rasa RL, where prior computational work (e.g., learned policies) is transferred from agent to another.
Deep RL at the Edge of the Statistical Precipice
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare
NeurIPS 2021 (Outstanding Paper Award)
Our findings call for a change in how we evaluate performance on deep RL benchmarks, for which we present more reliable protocols and an
open-source library , easily applicable with *even a handful of runs*, to prevent unreliable results
from stagnating the field.
Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning
Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G. Bellemare
ICLR 2021 (Spotlight)
To improve generalization, we learn representations, via a contrastive loss, that puts states together with similar long-term optimal behavior. This is orthogonal to existing
approaches such as data augmentation. An earlier version was accepted as an oral presentation at NeurIPS 2020 Workshop on Biological and Artificial RL.
Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning
Aviral Kumar*, Rishabh Agarwal*, Dibya Ghosh, Sergey Levine
ICLR 2021
We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions,
approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by
previous instances of the value network, more gradient updates decrease the expressivity of the current value network.
RL Unplugged: Benchmarks for Offline Reinforcement Learning
Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gómez Colmenarejo, Konrad Zolna, Rishabh Agarwal,
Josh Merel, Daniel Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matt Hoffman, Ofir Nachum,
George Tucker, Nicolas Heess, Nando de Freitas
NeurIPS 2020
We propose a benchmark called RL Unplugged to evaluate and compare offline RL methods on a diverse range of domains. We provide detailed evaluation
protocols for each domain and provide an extensive analysis of existing methods using these protocols. We hope that our suite of benchmarks will
increase the reproducibility in offline RL and make it possible to study challenging tasks with a limited computational budget, thus making RL research
both more systematic and more accessible across the community.
An Optimistic Perspective on Offline Reinforcement Learning
Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi
ICML 2020 (Talk)
This paper popularized offline RL and showed that standard off-policy algorithms perform quite well in the fully
off-policy / offline deep RL setting with large and diverse datasets. A previous version was titled "Striving for Simplicity in Off-Policy Deep Reinforcement Learning"
and presented as a contributed talk at NeurIPS 2019 DRL workshop.
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal
Training generative reward models (GenRM) using next-token prediction, jointly on verification and solution generation. Such generative verifiers can use chain-of-thought (CoT) reasoning, and additional test-time compute via majority voting for better verification. Generative CoT verifiers trained on GSM generalizes to much harder problems in MATH!
Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville, Rishabh Agarwal
LLMs, especially smaller and cost-efficient LLMs, exhibit systematic differences in their reasoning abilities, despite what their performance on standard math benchmarks indicates.
Rishabh Agarwal*, Avi Singh*, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan et al.
NeurIPS (Spotlight), Oral@ICML Long-Context Workshop
Explores ICL with hundreds or thousands of examples. Unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning.
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal
CoLM 2024
Self-improvement approaches, such as ReST^EM and STaR, discard all the LLM-generated incorrect solutions during training. V-STaR augments such approaches by training a verifier using both correct and incorrect solutions, used at test-time for re-ranking LLM generations.
Avi Singh*, JD Coreyes*, Rishabh Agarwal* et al.
TMLR 2024
We explore whether we can go beyond human data on tasks where we have access to scalar feedback, finding that a simple self-training method based on expectation-maximization can substantially reduce dependence on human-generated data.
Rishabh Agarwal*, Nino Vieillard*, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
ICLR 2024
GKD tackles distribution-mismatch in distilling autoregressive models, and outperforms commonly-used approaches on distilling LLMs for summarization, translation and reasoning tasks. Used for post-training distillation for Gemma-2.
Max Schwarzer, Johan Obando-Ceron, Aaron Courville, Marc Bellemare, Pablo Samuel Castro*, Rishabh Agarwal*
ICML 2023
With scaling compute and model size along with appropriate design choices, value-based methods achieve super-human performance in the Atari 100K, while being 4x compute efficient than SOTA.
Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, Sergey Levine
ICLR 2023 (Top 5%), NeurIPS 2022 DRL Workshop Best paper Runner-up
With appropriate design choices, offline Q-learning exhibit strong performance that scales with model capacity. The secret ingredients were training on a large and diverse offline dataset with ResNets, distributional C51 backups and feature normalization (that is make RL training look more like SL).
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare
NeurIPS 2022
This work proposes an alternative research workflow to tabula rasa RL, where prior computational work (e.g., learned policies) is transferred from agent to another.
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare
NeurIPS 2021 (Outstanding Paper Award)
Our findings call for a change in how we evaluate performance on deep RL benchmarks, for which we present more reliable protocols and an open-source library , easily applicable with *even a handful of runs*, to prevent unreliable results from stagnating the field.
Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G. Bellemare
ICLR 2021 (Spotlight)
To improve generalization, we learn representations, via a contrastive loss, that puts states together with similar long-term optimal behavior. This is orthogonal to existing approaches such as data augmentation. An earlier version was accepted as an oral presentation at NeurIPS 2020 Workshop on Biological and Artificial RL.
Aviral Kumar*, Rishabh Agarwal*, Dibya Ghosh, Sergey Levine
ICLR 2021
We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network.
Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gómez Colmenarejo, Konrad Zolna, Rishabh Agarwal,
Josh Merel, Daniel Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matt Hoffman, Ofir Nachum,
George Tucker, Nicolas Heess, Nando de Freitas
NeurIPS 2020
We propose a benchmark called RL Unplugged to evaluate and compare offline RL methods on a diverse range of domains. We provide detailed evaluation protocols for each domain and provide an extensive analysis of existing methods using these protocols. We hope that our suite of benchmarks will increase the reproducibility in offline RL and make it possible to study challenging tasks with a limited computational budget, thus making RL research both more systematic and more accessible across the community.
Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi
ICML 2020 (Talk)
This paper popularized offline RL and showed that standard off-policy algorithms perform quite well in the fully off-policy / offline deep RL setting with large and diverse datasets. A previous version was titled "Striving for Simplicity in Off-Policy Deep Reinforcement Learning" and presented as a contributed talk at NeurIPS 2019 DRL workshop.