Salesforces CRMArena-Pro Test Reveals Challenges Faced by AI Agents in Real Business Scenarios

A new Salesforce test, CRMArena-Pro, reveals significant challenges faced by AI agents in a business context. Even leading models like Gemini 2.5 Pro achieve a successful response rate of only 58% with a single prompt. When engaged in longer dialogue, this figure drops to just 35%.

The CRMArena-Pro is designed to evaluate how effectively large language models (LLMs) can operate as agents in real business environments, particularly for tasks related to CRM such as sales, customer service, and pricing. This test builds on the original CRMArena by introducing additional business functionalities, multi-step dialogues, and data privacy assessments. Utilizing synthetic data within Salesforce, the team developed 4,280 tasks across 19 types of business operations and three data protection categories.

The results illustrate the limitations of current LLMs. In straightforward, single-step tasks, even advanced models like Gemini 2.5 Pro achieve a maximum accuracy of only 58%. However, once systems are required to engage in multi-step dialogues to fill in missing details through questioning, performance declines to 35%.

Salesforce conducted extensive testing involving nine different LLMs and discovered that most models struggle to ask relevant clarifying questions. An analysis of 20 unsuccessful multi-step tasks using Gemini 2.5 Pro revealed that almost half remained unresolved due to the model’s failure to request crucial information. Models that ask more questions tend to handle such assignments better.

The highest success rates were observed in workflow automation tasks, such as routing support inquiries, where Gemini 2.5 Pro achieved an 83% success rate. However, accuracy sharply decreased when tasks required text comprehension or rule-following, such as identifying incorrect product configurations or extracting data from call logs.

A previous study by Salesforce and Microsoft found similar issues: even the most advanced LLMs become considerably less reliable as conversations lengthen and users gradually disclose their needs. In these multi-step scenarios, performance dropped by an average of 39%.

This test also highlights gaps in data privacy compliance. By default, LLMs seldom recognize and reject requests for sensitive information like personal data or proprietary company information.

Only after explicitly incorporating privacy rules into the system prompts did models begin to decline these requests, though this came at the cost of overall performance. For instance, GPT-4o increased the detection of sensitive data from zero to 34.2% but saw a 2.7% decrease in task completion.

Open-source models like LLaMA-3.1 exhibited even less responsiveness to prompt adjustments, indicating they require more thorough training to prioritize instructions effectively.

According to one of the authors, Kung-Xiang Steve Huang, data protection tests have rarely been included in comparative assessments thus far. CRMArena-Pro marks the first systematic effort to measure this parameter.

On a related note, I recommend checking out BotHub—a platform where you can test all popular models without limitations. No VPN is required for access, and Russian bank cards are accepted. You can get 100,000 free tokens for your initial tasks and start working right away by using this link!

Source