AI coding agents are everywhere. They write your functions, scaffold your database schemas, and wire up authentication flows. But here's the question nobody was answering: which AI model actually knows Appwrite best?
We built Appwrite Arena to find out.
What is Arena?
Appwrite Arena is an open-source benchmark that evaluates how well large language models understand Appwrite. It tests models across real-world usage scenarios, covering everything from Auth and Databases to Functions, Storage, Sites, and more.
Arena doesn't just measure general coding ability. It measures Appwrite-specific knowledge: correct SDK usage, accurate API patterns, and proper service configuration. The kind of knowledge that determines whether your AI agent generates working Appwrite code or something that looks right but breaks at runtime.
All questions, answers, and scores are fully open source and available on GitHub.
How it works
Arena evaluates each model using 191 questions spanning 9 Appwrite service categories:
- Foundation
- Auth
- Databases
- Functions
- Storage
- Sites
- Messaging
- Realtime
- CLI
Questions are drawn from actual Appwrite platform usage, the same kinds of tasks developers encounter daily when building with Appwrite.
With and without Skills
Each model is tested in two contexts:
- Without Skills: The model answers using only its built-in training data.
- With Skills: The model answers with access to Appwrite's Skills files, which provide up-to-date SDK and API context.
This comparison reveals something important: how much a model improves when given the right documentation. Some models see dramatic improvements. Others barely move and context gets bloated. That gap tells you a lot about which models are best at leveraging context and which ones rely more on memorized patterns.
Scoring
Arena uses two complementary scoring methods to give you a complete picture.
Deterministic (MCQ)
165 multiple-choice questions with a single correct answer. These scores are fully reproducible with no judge bias, giving you a reliable baseline for comparing models.
AI-judged (open-ended)
26 open-ended questions scored from 0 to 1 by an AI judge using rubrics and reference answers. These test reasoning and real-world usage patterns that multiple-choice questions can't capture. Scores may have slight variance due to the nature of AI-based evaluation.
The combination of both methods ensures that Arena measures not just factual recall but also a model's ability to reason about Appwrite in practice.
Why this matters
If you're using AI agents to build with Appwrite, the model you choose directly affects your productivity. A model that understands Appwrite means fewer hallucinated method calls, fewer trips to the docs, and less time debugging AI-generated code.
Arena gives you the data to make that choice. Instead of guessing which model works best, you can see exactly how each one performs across every Appwrite service.
Cost vs. performance. The best-in-class models are expensive. Arena helps you answer whether a top-tier model is actually worth the price for your project, or whether a model that's significantly cheaper or faster gets you close enough.
Always up to date. Because Arena is a benchmark, we can rerun it whenever new models or updates drop. This isn't a one-time comparison. It's a living source of truth you can come back to anytime you're evaluating models.
Response duration matters. Two models with similar token throughput can have very different benchmark durations. A slower run often means the model is spending more tokens to reach the same answer, which translates to a slower developer experience in practice.
And if you've installed Appwrite Skills, Arena shows you exactly how much they improve your model's performance, so you can decide whether skills are worth adding to your workflow.
Open source, fully transparent
Every part of Arena is open source. The questions, the reference answers, the scoring rubrics, and the results. You can verify any score, reproduce any benchmark run, and see exactly how each model was evaluated.
You can also contribute. If you think a question is missing, a rubric could be improved, or a new model should be added, open a PR on the Arena GitHub repository.
Arena is already receiving community contributions. Thank you to Abhi Varde for submitting a bug fix before launch.
Early results
Here are some highlights from the first round of benchmarks:
- GPT-5.4 currently ranks as the best model with skills, while Claude Opus 4.6 leads without skills.
- GPT-5.3 Codex used surprisingly few tokens compared to other models, roughly 50% less.
- Open-source models like DeepSeek and MiniMax offer the best balance between intelligence and cost.
- Skills have a massive impact on MiniMax, making it one of the biggest improvers with added context.
- Most recent models tend to perform best with Databases questions.
Head to arena.appwrite.io to explore the full leaderboard, compare models side by side, and find the best fit for your Appwrite development workflow.



