Llama 4 Models: Strong Performance in Standard Tests but Struggles with Complex Long-Context Tasks

Recent independent evaluations reveal that Meta’s latest models, Llama 4 — Maverick and Scout, perform well on standard tests but struggle with complex tasks requiring long context.

According to the collective «intelligence index» from Artificial Analysis, Llama 4 Maverick scored 49 points, while Scout received a score of 36. This places Maverick above Claude 3.7 Sonnet, but below Deepseek V3 0324. Meanwhile, Scout ranks similarly to GPT-4o-mini, outperforming Claude 3.5 Sonnet and Mistral Small 3.1.

Both models exhibited consistent performance in general logical, programming, and mathematical tasks, without showing notable weaknesses in any specific area.

The architecture of Maverick demonstrates efficiency, employing only half the active parameters of Deepseek V3 (17 billion versus 37 billion) and roughly 60% of the total number of parameters (402 billion compared to 671 billion). Unlike Deepseek V3, which only processes text, Maverick is capable of handling images as well.

According to Artificial Analysis, the average cost for Maverick is $0.24 per million input/output tokens, while for Scout it is $0.15 per million. These prices are lower even than those of budget-friendly Deepseek-V3 and are ten times less than OpenAI’s GPT-4.

The launch of Llama 4 has not been without controversy. Several testers have reported significant discrepancies in performance between LMArena — the benchmark actively promoted by Meta — and the model’s performance on other platforms, even when using Meta’s recommended prompt system.

Meta has confirmed that an «experimental chat version» of Maverick was used for this test and suggested possible optimizations for evaluators through comprehensive and well-structured responses with clear formatting.

In fact, when «Style Control» is activated in LMArena — a method that distinguishes content quality from presentation style — Llama 4 drops from second to fifth place. This system attempts to isolate content quality by accounting for factors like response length and formatting. It’s worth noting that other AI model developers are likely employing similar testing optimization strategies.

Most significant issues arose in tests conducted by Fiction.live, which assess comprehension of complex texts through multi-layered narratives.

Fiction.live claims that their tests better reflect real-world use cases by measuring genuine comprehension rather than mere information retrieval capabilities. Models must track temporal shifts, make logical predictions based on existing information, and distinguish between the reader’s knowledge and the character’s knowledge.

Llama 4’s performance was disappointing in these challenging tests. Maverick showed no improvement over Llama 3.3 70B, while Scout delivered «simply atrocious» results.

The contrast is striking: while Gemini 2.5 Pro maintains an accuracy rate of 90.6% with 120,000 tokens, Maverick only achieves 28.1%, and Scout performs at 15.6%.

These findings cast doubt on Meta’s claims about long context processing capabilities. Scout, which is said to handle 10 million tokens, struggles with just 128,000. Maverick also fails to consistently handle documents with 128,000 tokens, despite claiming a one-million-token context capability.

Research increasingly shows that large context windows provide fewer advantages than anticipated, as models struggle to evaluate all available information equally. Working with smaller context windows of up to 128 KB often proves to be more efficient, and users generally achieve even better results by breaking down large documents into chapters rather than processing them all at once.

In response to mixed feedback, Meta’s head of generative artificial intelligence, Ahmad Al-Dahle, explains that the initial inconsistencies reflect temporary implementation challenges rather than inherent limitations of the models themselves.

«Since we launched the models as soon as they were ready, we expect it will take a few days for all public implementations to be optimized,» Al-Dahle writes. He categorically denies allegations of training on the test dataset, stating that «this is simply untrue, and we would never do such a thing.» «In our view, the inconsistent quality that users observe is tied to the need to stabilize the implementation,» Al-Dahle emphasizes, noting that various services are still working on optimizing the deployment of Llama 4.

*Meta and its products (Instagram, Facebook) are prohibited in the territory of the Russian Federation.

[Source](https://the-decoder.com/metas-llama-4-models-show-promise-in-standard-tests-but-struggle-with-long-context-tasks/)