Table of Contents

🔭 Scope

| Research Question(s) | Stage 1: How do prompting-based agentic methods compare to fine-tuning agentic methods across a diverse set of benchmarks? Stage 2: How can we unite these 2 paradigms into a cohesive, novel, and performant fine-tuning agentic method? | | --- | --- | | Scope | Tags: AI, NLP, LLMs, LLM-based agent methods, prompting, fine-tuning

Independent variables:

🧠 Methods

The prompting and fine-tuning methods listed below are tentative and I’ve included the method’s code repository (if it exists) and whether or not I’m planning to include it in the study.

✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).

Prompting Code? Include? Fine-tuning Code? Include?
ReAct FireAct
Reflexion AgentTuning
CRITIC Agent-FLAN
Self-Refine AgentOptimizer https://github.com/microsoft/autogen/blob/main/notebook/agentchat_agentoptimizer.ipynb
Language Agent Tree Search (LATS) REFINER
ExpeL ReWOO ⚠️
PreAct ⚠️ LUMOS ⚠️
AUTOACT ⚠️
RoT ⚠️ PREFER ⚠️
LearnAct ⚠️ SwiftSage ⚠️
Mirror ⚠️ ETO ⚠️
More Agents Is All You Need (MAIAYN) https://anonymous.4open.science/r/more_agent_is_all_you_need ⚠️ Negative-Aware Trainin (NAT) ⚠️
AnyTool ⚠️ α-UMi ⚠️
Self-Demos ⚠️ Gorilla ⚠️
Investigate-Consolidate-Exploit (ICE) 🚫 🚫 Toolformer ⚠️
Self-Convince 🚫 🚫 ToolLLM ⚠️
RankPrompt 🚫 🚫 ToRA ⚠️
BAGEL 🚫 🚫 KnowAgent ⚠️
AgentOhana ⚠️
ReST Meets ReAct 🚫 🚫
DUMA 🚫 🚫
CYCLE 🚫 🚫
TMBR 🚫 🚫
ART 🚫 🚫
AMOR 🚫 🚫

🧪 Benchmarks

✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).

** means the benchmark is located in a repository and may not be a standard benchmark.*

Conversational/QA Mathematical Reasoning Code Generation Decision-Making Alignment Commonsense Reasoning Misc.
HotpotQA ✅ GSM8k ✅ MBPP ✅ ALFWorld ✅ RealToxicityPrompts ⚠️ CommonGen(-Hard) AcronymGen*
FEVER ✅ SVAMP ✅ HumanEval ✅ WebShop ✅ HELM ToolBench
TriviaQA ✅ TabMWP ✅ LeetcodeHardGym AgentBench ✅ Moral Stories T-Eval
AmbigNQ ✅ MATH ⚠️ CodeNet SciWorld MINT
FED MWPBench PIE MiniWoB++
Sentiment Reversal* GSM1k SWEBench WebArena
StrategyQA ⚠️ ASDiv InfiAgent-DABench
MMLU ⚠️ GSM-Hard HumanEval-XL
Bamboogle DS-1000
MT-Bench MHPP
SOTUQA (ReWOO)
GAIA
BamTwoogle

💪 Methods x Benchmarks

✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).

Only ✅ methods are included and all benchmarks.

** means the benchmark is located in a repository and may not be a standard benchmark.*

Untitled

🤖 Models

✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).