Table of Contents
| Research Question(s) | Stage 1: How do prompting-based agentic methods compare to fine-tuning agentic methods across a diverse set of benchmarks? Stage 2: How can we unite these 2 paradigms into a cohesive, novel, and performant fine-tuning agentic method? | | --- | --- | | Scope | Tags: AI, NLP, LLMs, LLM-based agent methods, prompting, fine-tuning
Independent variables:
The prompting and fine-tuning methods listed below are tentative and I’ve included the method’s code repository (if it exists) and whether or not I’m planning to include it in the study.
✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).
Prompting | Code? | Include? | Fine-tuning | Code? | Include? |
---|---|---|---|---|---|
ReAct | ‣ | ✅ | FireAct | ‣ | ✅ |
Reflexion | ‣ | ✅ | AgentTuning | ‣ | ✅ |
CRITIC | ‣ | ✅ | Agent-FLAN | ‣ | ✅ |
Self-Refine | ‣ | ✅ | AgentOptimizer | https://github.com/microsoft/autogen/blob/main/notebook/agentchat_agentoptimizer.ipynb | ✅ |
Language Agent Tree Search (LATS) | ‣ | ✅ | REFINER | ‣ | ✅ |
ExpeL | ‣ | ✅ | ReWOO | ‣ | ⚠️ |
PreAct | ‣ | ⚠️ | LUMOS | ‣ | ⚠️ |
AUTOACT | ‣ | ⚠️ | |||
RoT | ‣ | ⚠️ | PREFER | ‣ | ⚠️ |
LearnAct | ‣ | ⚠️ | SwiftSage | ‣ | ⚠️ |
Mirror | ‣ | ⚠️ | ETO | ‣ | ⚠️ |
More Agents Is All You Need (MAIAYN) | https://anonymous.4open.science/r/more_agent_is_all_you_need | ⚠️ | Negative-Aware Trainin (NAT) | ‣ | ⚠️ |
AnyTool | ‣ | ⚠️ | α-UMi | ‣ | ⚠️ |
Self-Demos | ‣ | ⚠️ | Gorilla | ‣ | ⚠️ |
Investigate-Consolidate-Exploit (ICE) | 🚫 | 🚫 | Toolformer | ‣ | ⚠️ |
Self-Convince | 🚫 | 🚫 | ToolLLM | ‣ | ⚠️ |
RankPrompt | 🚫 | 🚫 | ToRA | ‣ | ⚠️ |
BAGEL | 🚫 | 🚫 | KnowAgent | ‣ | ⚠️ |
AgentOhana | ‣ | ⚠️ | |||
ReST Meets ReAct | 🚫 | 🚫 | |||
DUMA | 🚫 | 🚫 | |||
CYCLE | 🚫 | 🚫 | |||
TMBR | 🚫 | 🚫 | |||
ART | 🚫 | 🚫 | |||
AMOR | 🚫 | 🚫 |
✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).
** means the benchmark is located in a repository and may not be a standard benchmark.*
Conversational/QA | Mathematical Reasoning | Code Generation | Decision-Making | Alignment | Commonsense Reasoning | Misc. |
---|---|---|---|---|---|---|
HotpotQA ✅ | GSM8k ✅ | MBPP ✅ | ALFWorld ✅ | RealToxicityPrompts ⚠️ | CommonGen(-Hard) | AcronymGen* |
FEVER ✅ | SVAMP ✅ | HumanEval ✅ | WebShop ✅ | HELM | ToolBench | |
TriviaQA ✅ | TabMWP ✅ | LeetcodeHardGym | AgentBench ✅ | Moral Stories | T-Eval | |
AmbigNQ ✅ | MATH ⚠️ | CodeNet | SciWorld | MINT | ||
FED | MWPBench | PIE | MiniWoB++ | |||
Sentiment Reversal* | GSM1k | SWEBench | WebArena | |||
StrategyQA ⚠️ | ASDiv | InfiAgent-DABench | ||||
MMLU ⚠️ | GSM-Hard | HumanEval-XL | ||||
Bamboogle | DS-1000 | |||||
MT-Bench | MHPP | |||||
SOTUQA (ReWOO) | ||||||
GAIA | ||||||
BamTwoogle | ||||||
✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).
Only ✅ methods are included and all benchmarks.
** means the benchmark is located in a repository and may not be a standard benchmark.*
✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).