How Short Queries and Assertive Tone Increase Hallucinations in Language Models

According to a new study, many language models are more likely to produce inaccurate information when users request concise responses.

Researchers from Giskard evaluated leading language models using a multilingual benchmark test called Phare, focusing on how frequently these models «hallucinate.» The first release of the benchmark test addresses hallucinations—a problem that, as indicated by previous studies, accounts for over a third of all documented incidents involving large language models.

The findings reveal a clear pattern: many models are more inclined to generate hallucinations when users request brief answers or phrase their questions with excessive confidence.

Tasks that explicitly request brevity, such as «Answer briefly,» can compromise the factual accuracy of several models. In some instances, the robustness against hallucinations dropped by as much as 20 percent.

According to Phare testing, this decline is largely due to the fact that precise refutations often require longer and more detailed explanations. When models are forced to condense their responses—frequently to reduce token usage or wait times—they are more likely to sacrifice actual accuracy.

Some models were affected more severely than others. Grok 2, Deepseek V3, and GPT-4o mini exhibited a significant drop in performance under length constraints. Conversely, other models, such as Claude 3.7 Sonnet, Claude 3.5 Sonnet, and Gemini 1.5 Pro, remained relatively stable even when asked to respond succinctly.

The tone of user prompts also plays a critical role. Phrases like «I’m 100% sure that…» or «My teacher told me that…» make it less likely for certain models to correct false statements. This so-called flattery effect can decrease a model’s ability to challenge incorrect claims by as much as 15 percent.

«Models primarily optimized for user satisfaction consistently provide information that sounds plausible and authoritative, despite the dubious or nonexistent factual basis,» the study notes.

Smaller models, like GPT-4o mini, Qwen 2.5 Max, and Gemma 3 27B, are particularly sensitive to such user phrasing. In contrast, larger models from Anthropic and Meta*, including Claude 3.5, Claude 3.7, and Llama 4 Maverick, are much less affected by exaggerated user confidence.

The research also indicates that language models are likely to perform worse in real-world situations, such as manipulative phrasing or systemic constraints, than in idealized testing environments. This issue becomes especially concerning when applications prioritize brevity and user convenience over factual reliability.

Phare is a collaborative project involving Giskard, Google DeepMind, the European Union, and Bpifrance. Its goal is to create a comprehensive benchmark for assessing the safety and reliability of large language models. Future modules will investigate biases, harmfulness, and vulnerabilities to misuse.

Full results are available at phare.giskard.ai, where organizations can participate in further development. AI model testing can be conducted on the BotHub platform, which does not require a VPN for access. An invitation link offers 100,000 free tokens for initial tasks, allowing users to get started immediately.

Source