The Challenge: Evaluating Nuanced AI Performance

To optimize its Command A model, Cohere needed to understand how well it performed in enterprise scenarios, such as customer service or HR queries. They weren’t just looking for accuracy in responses, but nuance—did the model understand tone, context, and ambiguity? They needed smart, consistent, scalable human evaluations to provide that insight.

The Solution: A Trusted Partnership for Quality Data

Having partnered with espirai previously to train their Command R model for hallucination reduction, Cohere considered espirai a valued partner that helps them win in the marketplace. The espirai team is passionate when it comes to improving models. Cohere has trusted espirai with critical challenges, and our commitment to quality ensures they can develop a best-in-class model.

How did quality improve with espirai?

We maintain a really high bar for talent, with continuous observability that ensures our clients can trust the data. We aren’t afraid to challenge our clients, posing really complex questions that push them to create better data.

The result was that Command A is as good as, and in some cases much better than, its competitors at consistently answering in the requested language. For example, take Arabic dialects–its ADI2 score (a human evaluation metric) achieved a 9-point lead over GPT-4o and DeepSeek-V2.

The Commercial Impact

In head-to-head human evaluation across business, STEM, and coding tasks, Command A matches or outperforms its larger competitors, while offering superior throughput and increased efficiency. Command A excels on business-critical agentic and multilingual tasks, while‬ being deployable on just two GPUs, compared to other models that typically require as many as 32. Human evaluations drove this success because they test on real-world enterprise data and situations.

Head-to-head human evaluation win-rates on enterprise tasks. All examples are blind-annotated by espirai human annotators, assessing enterprise-focused accuracy, instruction following, and style. Throughput comparisons are between Command A on the Cohere platform, GPT-4o and Deepseek-V2 (TogetherAI) as reported by Artificial Analysis.