From Prompts to Protocols: Benchmarking Multi-Agent AI Frameworks
As large language models (LLMs) become more powerful and versatile, the focus is rapidly shifting from single-model prompting to cooperative multi-agent workflows. These frameworks enable agents to communicate, delegate tasks, and complete complex objectives in a structured and scalable manner.
In this post, I benchmark three leading orchestration frameworks: CrewAI, MetaGPT, and LangGraph. Each is tested with three open-source models from Ollama: LLaMA 3, Mistral, and Phi-3. The goal is to evaluate how well each framework handles real-world task orchestration using a consistent and meaningful use case.
Whether you're a developer building intelligent assistants or an architect evaluating agent frameworks for production, this post will help you compare the tradeoffs clearly and practically.
Benchmark Use Case: Report Generation via Agents
To keep the evaluation realistic and reproducible, I used a single well-defined task across all frameworks:
Produce a list of sections for the report titled “The Future of Work in an AI World,” covering:
- Impact of AI on white‑collar jobs
- Future of remote work with AI
- AI and automation in blue‑collar sectors
- Policy and ethical challenges
This task was modeled using three collaborating agents:
- Planner: Defines the report structure
- Researcher: Generates content for each section
- Composer: Compiles all outputs into the final report
Each framework implemented the same Planner → Researcher → Composer flow.
Benchmark Setup
- Hardware: RTX 3060 12GB GPU, 64GB RAM, Ubuntu
- Models via Ollama: LLaMA 3, Mistral, Phi-3
- Task type: Multi-agent collaboration
- Evaluation Criteria: Structure, insight, flow, collaboration, creativity
Benchmark Summary Table
Scored on a scale of 1.0–5.0 per dimension using decimal precision for nuance.
Key Findings
- MetaGPT + Phi-3: Highest overall score. Strong structure, original insights, and smooth collaboration across agents. Excellent planning and execution flow.
- CrewAI + Mistral: Very readable and structured. Slightly less creative but performed well across all criteria.
- LangGraph + Mistral: Consistent and accurate. Slightly rigid in voice, but high marks for flow and factual accuracy.
- CrewAI + LLaMA 3: Underperformed due to weak handoff between agents and placeholder-like output. Reflects either prompt alignment issues or framework limitations in this run.
Conclusion
All three frameworks can support multi-agent orchestration, but your ideal choice depends on the use case:
- CrewAI: Best for rapid prototyping and simple chains
- MetaGPT: Excels at clean logic and modular workflows
- LangGraph: Offers flow control for more complex or stateful tasks
Model choice also matters:
- Phi-3: Most balanced across all runs
- Mistral: Fast and sharp
- LLaMA 3: Good reasoning when agents were aligned properly
GitHub Repository: https://github.com/Algocrat/multi-agent-benchmark