Here's my Google Scholar. Here's how one should do research according to Richard Hamming.
Jump to preprints or selected publications.

Preprints

Generative Verifiers: Reward Modeling as Next-Token Prediction

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal

Training generative reward models (GenRM) using next-token prediction, jointly on verification and solution generation. Such generative verifiers can use chain-of-thought (CoT) reasoning, and additional test-time compute via majority voting for better verification. Generative CoT verifiers trained on GSM generalizes to much harder problems in MATH!

Not All LLM Reasoners are Created Equal

Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville, Rishabh Agarwal

LLMs, especially smaller and cost-efficient LLMs, exhibit systematic differences in their reasoning abilities, despite what their performance on standard math benchmarks indicates.

Selected Publications

Many-Shot In-Context Learning

Rishabh Agarwal*, Avi Singh*, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan et al.
NeurIPS (Spotlight), Oral@ICML Long-Context Workshop

Explores ICL with hundreds or thousands of examples. Unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning.

V-STaR: Training Verifiers for Self-Taught Reasoners

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal
CoLM 2024

Self-improvement approaches, such as ReST^EM and STaR, discard all the LLM-generated incorrect solutions during training. V-STaR augments such approaches by training a verifier using both correct and incorrect solutions, used at test-time for re-ranking LLM generations.

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Avi Singh*, JD Coreyes*, Rishabh Agarwal* et al.
TMLR 2024

We explore whether we can go beyond human data on tasks where we have access to scalar feedback, finding that a simple self-training method based on expectation-maximization can substantially reduce dependence on human-generated data.

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal*, Nino Vieillard*, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem
ICLR 2024

GKD tackles distribution-mismatch in distilling autoregressive models, and outperforms commonly-used approaches on distilling LLMs for summarization, translation and reasoning tasks. Used for post-training distillation for Gemma-2.

Bigger, Better, Faster: Human-level Atari with human-level efficiency

Max Schwarzer, Johan Obando-Ceron, Aaron Courville, Marc Bellemare, Pablo Samuel Castro*, Rishabh Agarwal*
ICML 2023

With scaling compute and model size along with appropriate design choices, value-based methods achieve super-human performance in the Atari 100K, while being 4x compute efficient than SOTA.

Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes

Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, Sergey Levine
ICLR 2023 (Top 5%), NeurIPS 2022 DRL Workshop Best paper Runner-up

With appropriate design choices, offline Q-learning exhibit strong performance that scales with model capacity. The secret ingredients were training on a large and diverse offline dataset with ResNets, distributional C51 backups and feature normalization (that is make RL training look more like SL).

Beyond Tabula Rasa: Reincarnating Reinforcement Learning

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare
NeurIPS 2022

This work proposes an alternative research workflow to tabula rasa RL, where prior computational work (e.g., learned policies) is transferred from agent to another.

Deep RL at the Edge of the Statistical Precipice

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare
NeurIPS 2021 (Outstanding Paper Award)

Our findings call for a change in how we evaluate performance on deep RL benchmarks, for which we present more reliable protocols and an open-source library , easily applicable with *even a handful of runs*, to prevent unreliable results from stagnating the field.

Website Paper Talk
Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning

Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G. Bellemare
ICLR 2021 (Spotlight)

To improve generalization, we learn representations, via a contrastive loss, that puts states together with similar long-term optimal behavior. This is orthogonal to existing approaches such as data augmentation. An earlier version was accepted as an oral presentation at NeurIPS 2020 Workshop on Biological and Artificial RL.

Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

Aviral Kumar*, Rishabh Agarwal*, Dibya Ghosh, Sergey Levine
ICLR 2021

We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network.

RL Unplugged: Benchmarks for Offline Reinforcement Learning

Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gómez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matt Hoffman, Ofir Nachum, George Tucker, Nicolas Heess, Nando de Freitas
NeurIPS 2020

We propose a benchmark called RL Unplugged to evaluate and compare offline RL methods on a diverse range of domains. We provide detailed evaluation protocols for each domain and provide an extensive analysis of existing methods using these protocols. We hope that our suite of benchmarks will increase the reproducibility in offline RL and make it possible to study challenging tasks with a limited computational budget, thus making RL research both more systematic and more accessible across the community.

An Optimistic Perspective on Offline Reinforcement Learning

Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi
ICML 2020 (Talk)

This paper popularized offline RL and showed that standard off-policy algorithms perform quite well in the fully off-policy / offline deep RL setting with large and diverse datasets. A previous version was titled "Striving for Simplicity in Off-Policy Deep Reinforcement Learning" and presented as a contributed talk at NeurIPS 2019 DRL workshop.