Table of Contents
| Research Question(s) | Stage 1: How do prompting-based agentic methods compare to fine-tuning agentic methods across a diverse set of benchmarks? Stage 2: How can we unite these 2 paradigms into a cohesive, novel, and performant fine-tuning agentic method? | | --- | --- | | Scope | Tags: AI, NLP, LLMs, LLM-based agent methods, prompting, fine-tuning
Independent variables:
The prompting and fine-tuning methods listed below are tentative and I’ve included the method’s code repository (if it exists) and whether or not I’m planning to include it in the study.
✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).
| Prompting | Code? | Include? | Fine-tuning | Code? | Include? |
|---|---|---|---|---|---|
| ReAct | ‣ | ✅ | FireAct | ‣ | ✅ |
| Reflexion | ‣ | ✅ | AgentTuning | ‣ | ✅ |
| CRITIC | ‣ | ✅ | Agent-FLAN | ‣ | ✅ |
| Self-Refine | ‣ | ✅ | AgentOptimizer | https://github.com/microsoft/autogen/blob/main/notebook/agentchat_agentoptimizer.ipynb | ✅ |
| Language Agent Tree Search (LATS) | ‣ | ✅ | REFINER | ‣ | ✅ |
| ExpeL | ‣ | ✅ | ReWOO | ‣ | ⚠️ |
| PreAct | ‣ | ⚠️ | LUMOS | ‣ | ⚠️ |
| AUTOACT | ‣ | ⚠️ | |||
| RoT | ‣ | ⚠️ | PREFER | ‣ | ⚠️ |
| LearnAct | ‣ | ⚠️ | SwiftSage | ‣ | ⚠️ |
| Mirror | ‣ | ⚠️ | ETO | ‣ | ⚠️ |
| More Agents Is All You Need (MAIAYN) | https://anonymous.4open.science/r/more_agent_is_all_you_need | ⚠️ | Negative-Aware Trainin (NAT) | ‣ | ⚠️ |
| AnyTool | ‣ | ⚠️ | α-UMi | ‣ | ⚠️ |
| Self-Demos | ‣ | ⚠️ | Gorilla | ‣ | ⚠️ |
| Investigate-Consolidate-Exploit (ICE) | 🚫 | 🚫 | Toolformer | ‣ | ⚠️ |
| Self-Convince | 🚫 | 🚫 | ToolLLM | ‣ | ⚠️ |
| RankPrompt | 🚫 | 🚫 | ToRA | ‣ | ⚠️ |
| BAGEL | 🚫 | 🚫 | KnowAgent | ‣ | ⚠️ |
| AgentOhana | ‣ | ⚠️ | |||
| ReST Meets ReAct | 🚫 | 🚫 | |||
| DUMA | 🚫 | 🚫 | |||
| CYCLE | 🚫 | 🚫 | |||
| TMBR | 🚫 | 🚫 | |||
| ART | 🚫 | 🚫 | |||
| AMOR | 🚫 | 🚫 |
✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).
** means the benchmark is located in a repository and may not be a standard benchmark.*
| Conversational/QA | Mathematical Reasoning | Code Generation | Decision-Making | Alignment | Commonsense Reasoning | Misc. |
|---|---|---|---|---|---|---|
| HotpotQA ✅ | GSM8k ✅ | MBPP ✅ | ALFWorld ✅ | RealToxicityPrompts ⚠️ | CommonGen(-Hard) | AcronymGen* |
| FEVER ✅ | SVAMP ✅ | HumanEval ✅ | WebShop ✅ | HELM | ToolBench | |
| TriviaQA ✅ | TabMWP ✅ | LeetcodeHardGym | AgentBench ✅ | Moral Stories | T-Eval | |
| AmbigNQ ✅ | MATH ⚠️ | CodeNet | SciWorld | MINT | ||
| FED | MWPBench | PIE | MiniWoB++ | |||
| Sentiment Reversal* | GSM1k | SWEBench | WebArena | |||
| StrategyQA ⚠️ | ASDiv | InfiAgent-DABench | ||||
| MMLU ⚠️ | GSM-Hard | HumanEval-XL | ||||
| Bamboogle | DS-1000 | |||||
| MT-Bench | MHPP | |||||
| SOTUQA (ReWOO) | ||||||
| GAIA | ||||||
| BamTwoogle | ||||||
✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).
Only ✅ methods are included and all benchmarks.
** means the benchmark is located in a repository and may not be a standard benchmark.*
✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).