Welcome, dear readers! In this month iteration, we are focusing on the intriguing paper called "Efficient Exploration for LLMs", which is a hot topic recently in the universe of Machine Learning (ML). Several interesting techniques employed by the wizards at the Google Deepmind team that allow ML models to learn much faster with fewer queries. It's like discovering a secret shortcut in a maze; suddenly, reaching high performance is not just a goal but a swift reality, especially in the time that Reinforcement Learning Human Feedback (RLHF) is the bottleneck gate slowing the invention of ML models.

Efficient exploration, as our guiding beacon, takes two primary forms according to the paper: active and passive exploration. In the passive realm, we stick to the traditional Reinforcement Learning from Human Feedback (RLHF) process, where a reward model is molded to user ratings of responses one prompt-response pair at a time. On the flip side, active exploration involves tailoring interactions to elicit useful feedback. In simpler terms,
eliciting useful feedback would involve the model actively learning from the user feedback to better understand what works well and what doesn't.
Before moving deeper into the technical details, enter key terms for uncertainties estimation such as Infomax and Double Thompson (double TS) sampling – strategies that sound like a pair of secret agents, working tirelessly behind the scenes to optimize the learning process. The "double" part means that it selects and compares pairs of options to make more efficient decisions while the "Thompson" word refers to Thompson sampling. Basically, the model will take 2 pairs of queries from a posterior distribution sample to compare and to select. The use of double TS in active exploration proves to be the star result of this paper, requiring much fewer queries to reach the pinnacle of performance. And, uncertainties estimated with the Infomax method, a method that select responses based on maximizing information gain, help LLMs to learn better then solely reward-based active explroation.
Architecture and Setup
Picture this: a symphony of data orchestrated by GGDeepmind, featuring prompts from the Anthropic Helpfulness Base dataset and starring the Gemini Nano and Pro models, or you can simply look at the figure below. The learning pipeline unfolds with the creation of learned reward models, each a different agent in this grand play. Model 1, armed with an MLP, assigns Boltzmann-algorithm-based points to prompt-response pairs. Meanwhile, Model 2, the epistemic index gained from Epistemic Neural Network (ENN) model, predicts points with uncertainty score using the power of either Infomax or double TS.
Every query (a prompt with 2 responses) is sampled, guided by the Bradley-Terry choice model. The Bradley-Terry choice model is a statistical model used to describe the outcome of a paired comparison experiment, where the task is to choose the better of two items based on its relative underlying strength measured by its win-loss outcomes. The models gracefully output the top two responses based on their likelihood of preference, then it learns from the feedback to adjust preference probability with associate reward models.
The assessment pipeline, akin to a critical review, compares the performance of each agent against baseline models (purely LLM models). It's a showdown of win rates, a spectacle where active exploration methods steal the spotlight with prowess, and double TS emerges as the undisputed champion. The win rate is simply the likelihood a model response is more likely to be chosen by real human raters.
Results
As the curtain rises on the results, a batch of 32 prompts, N equals 100, and a comparison over 2048 prompts reveal the true power of efficient exploration. In the realm of solely reward parameters, Model 1 showcases its skills with a 2-hidden-layer 128 neurons MLP, while Model 2 dazzles with a sample S of 10 MLPs for double TS and a sample S of 30 MLPs for the Infomax agent. At 30,000 queries, with passive exploration as the baseline, boasting a modest approximately 63% win rate. All active exploration methods outshine, with double TS leading the pack at a remarkable nearly 67%. Not only does double TS triumph, but it does so with less than half queries, proving its efficiency beyond doubt.
The future beckons us to a landscape where faster release iterations for AI models are the norm. As we embrace RLHF training, with its promise of faster convergence rates, we march towards a horizon where AI models evolve at an unprecedented pace, aligning more closely with human values and setting new standards for improvement. If it is proven to work by other organizations, the secrets of Efficient Exploration, unlocked by GGDeepmind, are not just a chapter in a book; they are the catalyst to boost the age of Machine Learning.
Comments