From Prompts to Protocols: Benchmarking Multi-Agent AI Frameworks

As large language models (LLMs) become more powerful and versatile, the focus is rapidly shifting from single-model prompting to cooperative multi-agent workflows. These frameworks enable agents to communicate, delegate tasks, and complete complex objectives in a structured and scalable manner.

In this post, I benchmark three leading orchestration frameworks: CrewAI, MetaGPT, and LangGraph. Each is tested with three open-source models from Ollama: LLaMA 3, Mistral, and Phi-3. The goal is to evaluate how well each framework handles real-world task orchestration using a consistent and meaningful use case.

Whether you're a developer building intelligent assistants or an architect evaluating agent frameworks for production, this post will help you compare the tradeoffs clearly and practically.

Benchmark Use Case: Report Generation via Agents

To keep the evaluation realistic and reproducible, I used a single well-defined task across all frameworks:

Produce a list of sections for the report titled “The Future of Work in an AI World,” covering:

  • Impact of AI on white‑collar jobs
  • Future of remote work with AI
  • AI and automation in blue‑collar sectors
  • Policy and ethical challenges

This task was modeled using three collaborating agents:

  • Planner: Defines the report structure
  • Researcher: Generates content for each section
  • Composer: Compiles all outputs into the final report

Each framework implemented the same Planner → Researcher → Composer flow.


Benchmark Setup

  • Hardware: RTX 3060 12GB GPU, 64GB RAM, Ubuntu
  • Models via Ollama: LLaMA 3, Mistral, Phi-3
  • Task type: Multi-agent collaboration
  • Evaluation Criteria: Structure, insight, flow, collaboration, creativity

Benchmark Summary Table

Scored on a scale of 1.0–5.0 per dimension using decimal precision for nuance.

Key Findings

  • MetaGPT + Phi-3: Highest overall score. Strong structure, original insights, and smooth collaboration across agents. Excellent planning and execution flow.
  • CrewAI + Mistral: Very readable and structured. Slightly less creative but performed well across all criteria.
  • LangGraph + Mistral: Consistent and accurate. Slightly rigid in voice, but high marks for flow and factual accuracy.
  • CrewAI + LLaMA 3: Underperformed due to weak handoff between agents and placeholder-like output. Reflects either prompt alignment issues or framework limitations in this run.

Conclusion

All three frameworks can support multi-agent orchestration, but your ideal choice depends on the use case:

  • CrewAI: Best for rapid prototyping and simple chains
  • MetaGPT: Excels at clean logic and modular workflows
  • LangGraph: Offers flow control for more complex or stateful tasks

Model choice also matters:

  • Phi-3: Most balanced across all runs
  • Mistral: Fast and sharp
  • LLaMA 3: Good reasoning when agents were aligned properly

GitHub Repository: https://github.com/Algocrat/multi-agent-benchmark

Popular posts from this blog

Setup Pivot4J Analytics to Consume SSAS Multidimensional Models

How to write MDX or DMX stored procedures

Mastering Quartz.NET: From Basics to a Full-Fledged Job Scheduling Service