Table of Contents

🔭 Scope

| Research Question(s) | Stage 1: How do prompting-based agentic methods compare to fine-tuning agentic methods across a diverse set of benchmarks? Stage 2: How can we unite these 2 paradigms into a cohesive, novel, and performant fine-tuning agentic method? | | --- | --- | | Scope | Tags: AI, NLP, LLMs, LLM-based agent methods, prompting, fine-tuning

Independent variables:

**Methods
Model
Dataset Complexity & Diversity (⚠️)
Training Duration (⚠️)
Hyperparameters (⚠️)
Scalability (⚠️)** | | Objective | Stage 1: An insights-driven investigation into prompting and fine-tuning-based methods on diverse tasks (QA, mathematical reasoning, code generation, decision-making, alignment, etc.). Stage 2: A methods paper for a novel fine-tuning agentic method. | | Success Indicators | Metrics: ****- Performance
Cost
**Latency
Safety/Alignment (⚠️)
Qualitative Analysis (⚠️)** | | Target Audience | Researchers, Open-source developers | | Impact/Contributions | Stage 1: Learnings from a thorough investigation into LLM-based agentic methods (prompting and fine-tuning). Best practices and insights gathered from experimentation. Stage 2: A novel, performant, universal method synthesized from our insights. |

🧠 Methods

The prompting and fine-tuning methods listed below are tentative and I’ve included the method’s code repository (if it exists) and whether or not I’m planning to include it in the study.

✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).

Prompting	Code?	Include?	Fine-tuning	Code?	Include?
ReAct	‣	✅	FireAct	‣	✅
Reflexion	‣	✅	AgentTuning	‣	✅
CRITIC	‣	✅	Agent-FLAN	‣	✅
Self-Refine	‣	✅	AgentOptimizer	https://github.com/microsoft/autogen/blob/main/notebook/agentchat_agentoptimizer.ipynb	✅
Language Agent Tree Search (LATS)	‣	✅	REFINER	‣	✅
ExpeL	‣	✅	ReWOO	‣	⚠️
PreAct	‣	⚠️	LUMOS	‣	⚠️
			AUTOACT	‣	⚠️
RoT	‣	⚠️	PREFER	‣	⚠️
LearnAct	‣	⚠️	SwiftSage	‣	⚠️
Mirror	‣	⚠️	ETO	‣	⚠️
More Agents Is All You Need (MAIAYN)	https://anonymous.4open.science/r/more_agent_is_all_you_need	⚠️	Negative-Aware Trainin (NAT)	‣	⚠️
AnyTool	‣	⚠️	α-UMi	‣	⚠️
Self-Demos	‣	⚠️	Gorilla	‣	⚠️
Investigate-Consolidate-Exploit (ICE)	🚫	🚫	Toolformer	‣	⚠️
Self-Convince	🚫	🚫	ToolLLM	‣	⚠️
RankPrompt	🚫	🚫	ToRA	‣	⚠️
BAGEL	🚫	🚫	KnowAgent	‣	⚠️
			AgentOhana	‣	⚠️
			ReST Meets ReAct	🚫	🚫
			DUMA	🚫	🚫
			CYCLE	🚫	🚫
			TMBR	🚫	🚫
			ART	🚫	🚫
			AMOR	🚫	🚫

🧪 Benchmarks

✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).

** means the benchmark is located in a repository and may not be a standard benchmark.*

Conversational/QA	Mathematical Reasoning	Code Generation	Decision-Making	Alignment	Commonsense Reasoning	Misc.
HotpotQA ✅	GSM8k ✅	MBPP ✅	ALFWorld ✅	RealToxicityPrompts ⚠️	CommonGen(-Hard)	AcronymGen*
FEVER ✅	SVAMP ✅	HumanEval ✅	WebShop ✅	HELM		ToolBench
TriviaQA ✅	TabMWP ✅	LeetcodeHardGym	AgentBench ✅	Moral Stories		T-Eval
AmbigNQ ✅	MATH ⚠️	CodeNet	SciWorld			MINT
FED	MWPBench	PIE	MiniWoB++
Sentiment Reversal*	GSM1k	SWEBench	WebArena
StrategyQA ⚠️	ASDiv	InfiAgent-DABench
MMLU ⚠️	GSM-Hard	HumanEval-XL
Bamboogle		DS-1000
MT-Bench		MHPP
SOTUQA (ReWOO)
GAIA
BamTwoogle

💪 Methods x Benchmarks

✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).

Only ✅ methods are included and all benchmarks.

** means the benchmark is located in a repository and may not be a standard benchmark.*

Untitled

🤖 Models

✅ means I’ll include it, ⚠️ means not sure, and 🚫 means I won’t include it (though it might be useful to just include their results).