AI Agents Outperform Human Teams in Cybersecurity Competitions: A New Era in Hacking Challenges

A recent series of cybersecurity competitions organized by Palisade Research demonstrated that autonomous AI agents can directly compete against human hackers and, at times, surpass them.

Palisade Research evaluated AI systems in two large-scale Capture The Flag (CTF) tournaments, which featured thousands of participants. In these CTF competitions, teams veered to find hidden «flags» by solving security challenges ranging from cryptographic hacking to identifying software vulnerabilities.

The goal was to assess how well AI agents performed in comparison to human teams. The results revealed that AI agents significantly outperformed expectations, outpacing the majority of their human competitors.

The first competition, titled «AI vs. Humans,» featured six AI teams competing against approximately 150 human teams. Each team had 48 hours to tackle 20 challenges related to cryptography and reverse engineering.

Out of the seven AI agents, four successfully solved 19 out of 20 potential tasks. The leading AI team ranked among the top five overall, indicating that most of them outperformed the majority of human participants. The puzzles presented were solvable locally, making them accessible even to AI models with technical constraints.

Despite this, the top human teams were not far behind the AI. They cited their years of professional experience in CTF events and a deep understanding of common problem-solving strategies as key advantages. One participant mentioned their involvement in several international-level teams.

In the second competition, called «Cyber Apocalypse,» the stakes were higher. AI agents had to tackle a new set of challenges while competing against nearly 18,000 human players. Many of the 62 tasks required interaction with external devices, posing a significant hurdle for AI agents, most of which were designed for local execution.

Four AI agents entered the fray. The top-performing agent, CAI, solved 20 out of 62 challenges and secured the 859th position, placing it among the top teams and within 21% of all active competitors. According to Palisade Research, the best AI system outperformed around 90% of the human teams.

The study also analyzed the difficulty level of the tasks that AI managed to solve. Researchers used the time taken by the best human teams to complete the same tasks as a benchmark. For tasks that even top human teams took about 78 minutes to solve, the AI achieved a 50% success rate. In other words, AI managed challenges that posed real difficulties even for experts.

Previous assessments, like CyberSecEval 2 and the InterCode-CTF test, had rated AI cybersecurity skills much lower, noted Palisade researchers. In both instances, later teams increased the success rate of attacks by altering parameter settings. For instance, the Google Naptime project achieved 100% success in memory attacks with the right configurations.

According to Petrov and Volkov, this situation highlights what they refer to as an «evaluation gap»: the real capabilities of AI are often underestimated due to limited assessment methods. This gap suggests that traditional tests may not capture the full potential of AI systems. Palisade Research posits that crowdsourced competitions should be used as a complement to standard evaluations, as events like «AI vs. Humans» yield more meaningful and politically relevant data than conventional tests.

Additionally, I would like to recommend [BotHub](https://bothub.chat/?utm_source=contentmarketing&utm_medium=habr&utm_campaign=news&utm_content=AI_AGENTS_EXCEL_HUMAN_TEAMS_IN_HACKING_COMPETITIONS) – a platform where you can test all popular models without restrictions. No VPN is needed to access the service, and Russian cards can be used. [Follow this link](https://bothub.chat/?invitedBy=m_aGCkuyTgqllHCK0dUc7) to receive 100,000 free tokens for your first tasks and start working right away!

[Source](https://the-decoder.com/ai-agents-outperform-human-teams-in-hacking-competitions/)