From Prompts to Protocols: Benchmarking Multi-Agent AI Frameworks

- July 20, 2025

As large language models (LLMs) become more powerful and versatile, the focus is rapidly shifting from single-model prompting to cooperative multi-agent workflows. These frameworks enable agents to communicate, delegate tasks, and complete complex objectives in a structured and scalable manner.

In this post, I benchmark three leading orchestration frameworks: CrewAI, MetaGPT, and LangGraph. Each is tested with three open-source models from Ollama: LLaMA 3, Mistral, and Phi-3. The goal is to evaluate how well each framework handles real-world task orchestration using a consistent and meaningful use case.

Whether you're a developer building intelligent assistants or an architect evaluating agent frameworks for production, this post will help you compare the tradeoffs clearly and practically.

Benchmark Use Case: Report Generation via Agents

To keep the evaluation realistic and reproducible, I used a single well-defined task across all frameworks:

Produce a list of sections for the report titled “The Future of Work in an AI World,” covering:

Impact of AI on white‑collar jobs
Future of remote work with AI
AI and automation in blue‑collar sectors
Policy and ethical challenges

This task was modeled using three collaborating agents:

Planner: Defines the report structure
Researcher: Generates content for each section
Composer: Compiles all outputs into the final report

Each framework implemented the same Planner → Researcher → Composer flow.

Benchmark Setup

Hardware: RTX 3060 12GB GPU, 64GB RAM, Ubuntu
Models via Ollama: LLaMA 3, Mistral, Phi-3
Task type: Multi-agent collaboration
Evaluation Criteria: Structure, insight, flow, collaboration, creativity

Benchmark Summary Table

Scored on a scale of 1.0–5.0 per dimension using decimal precision for nuance.

Key Findings

MetaGPT + Phi-3: Highest overall score. Strong structure, original insights, and smooth collaboration across agents. Excellent planning and execution flow.
CrewAI + Mistral: Very readable and structured. Slightly less creative but performed well across all criteria.
LangGraph + Mistral: Consistent and accurate. Slightly rigid in voice, but high marks for flow and factual accuracy.
CrewAI + LLaMA 3: Underperformed due to weak handoff between agents and placeholder-like output. Reflects either prompt alignment issues or framework limitations in this run.

Conclusion

All three frameworks can support multi-agent orchestration, but your ideal choice depends on the use case:

CrewAI: Best for rapid prototyping and simple chains
MetaGPT: Excels at clean logic and modular workflows
LangGraph: Offers flow control for more complex or stateful tasks

Model choice also matters:

Phi-3: Most balanced across all runs
Mistral: Fast and sharp
LLaMA 3: Good reasoning when agents were aligned properly

GitHub Repository: https://github.com/Algocrat/multi-agent-benchmark

Search This Blog

Intense Analytics