AdaReasoner | Adaptive Reasoning for Flexible Thinking

Project Overview

AdaReasoner is a novel, plug-and-play framework designed to automatically configure reasoning strategies for Large Language Models (LLMs). Unlike traditional prompting methods that use static configurations, AdaReasoner dynamically selects optimal parameters (like temperature and number of reasoning steps), adapting to different tasks for improved performance and robustness.

Motivation

Why AdaReasoner?

Large language models have shown remarkable reasoning abilities, but their success often hinges on getting just the right settings—like how many steps to take, whether to think step-by-step, or how much randomness to allow. The problem? These settings are usually fixed, and what works for one task might completely fail on another. AdaReasoner was created to tackle this issue head-on: instead of relying on a one-size-fits-all approach, it learns to adapt the reasoning strategy to each specific question, making the model more flexible, reliable, and effective across a wide range of tasks.

Key Contributions

Main Contributions

AdaReasoner is an LLM-agnostic plugin that automates adaptive reasoning configurations for tasks requiring diverse types of thinking, implemented as a reinforcement learning framework with a factorized action space.

Its training is data-efficient yet scalable, requiring only a small number of samples for each task aided by the use of the Boltzmann exploration mechanism.

Extensive evaluations on diverse tasks show that AdaReasoner outperforms standard Chain-of-Thought (CoT) and other baselines, and sustains strong out-of-distribution (OOD) performance.

Core Idea

Adaptive Reasoning

AdaReasoner introduces adaptive reasoning through a model-agnostic, reinforcement learning (RL)-based policy system. Its main innovations include:

Factorized Action Space: Separates reasoning parameters (e.g., temperature, steps) to simplify and speed up policy learning.

Targeted Exploration Strategy: Guides exploration towards the most influential configurations.

Reward Model: Trained to predict the quality of reasoning outcomes, enabling efficient training with only a few examples.

Figure: Overview of AdaReasoner's architecture and workflow

Technical Highlights

RL Framework

Learns to select the best configuration for each prompt using reward signals from a pre-trained model.

Sample Efficiency

Achieves strong performance with minimal supervision and data.

Compatibility

Can be plugged into any LLM without modification.

Experimental Results

AdaReasoner was tested across six different LLMs and various reasoning tasks. Key findings include:

Improved Accuracy: Outperforms static prompting strategies on knowledge-intensive tasks.

Robustness: Maintains strong performance on out-of-distribution and few-shot scenarios.

Efficiency: Learns effective policies with limited computational resources.

🏆 Main Results (GPT-4o)

Method	Metaphor	TruthfulQA	MMLU (Math)	LogiQA	Average
CoT	50.40	78.40	76.04	70.00	68.71
Think Short	61.00	64.81	68.52	70.81	66.28
ToT	48.25	74.29	86.11	73.90	70.91
Best-of-N	52.60	79.41	83.41	72.37	71.95
Auto-CoT	62.33	83.09	72.15	71.71	72.32
In-context CoT	53.98	77.04	83.63	80.04	74.42
AdaReasoner	71.56	81.30	86.49	82.31	80.42

Team & Acknowledgements

Authors

Xiangqi Wang¹ Yue Huang¹ Yanbo Wang² Xiaonan Luo¹ Kehan Guo¹ Yujun Zhou¹ Xiangliang Zhang¹

* Equal Contribution

¹ University of Notre Dame
² Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)

Corresponding Author: xzhang33@nd.edu (Xiangliang Zhang)

University of Notre Dame

Mohamed bin Zayed University of Artificial Intelligence

MBZUAI