Beyond Chatbots: Microsoft’s New Tool Measures if AI Can Think Like a Cyber Analyst
Microsoft has unveiled a new tool designed to assess the effectiveness of artificial intelligence in cybersecurity. The platform, named ExCyTIn-Bench, recreates conditions closely resembling those of a threat monitoring center, enabling evaluations of how accurately and consistently AI models investigate real-world incidents. It is Microsoft’s first open benchmark that measures not merely a model’s knowledge, but its capacity to analyze, hypothesize, and articulate reasoning based on vast datasets of security logs.
ExCyTIn-Bench employs data from 57 telemetry tables drawn from Microsoft Sentinel and related services, mirroring the genuine scale, noise, and complexity of the data handled daily by SOC analysts. Rather than relying on conventional question-and-answer tests, the system simulates multi-stage attacks in which the AI agent must construct queries, correlate data sources, and uncover indicators of compromise. This approach assesses not random correctness, but the depth of logic and completeness of investigation.
For enterprise cybersecurity teams, the tool serves as a new benchmark when selecting AI-driven solutions. It allows decision-makers to understand how effectively a model can conduct comprehensive investigations, adapt to evolving threats, and justify its conclusions. Microsoft already employs ExCyTIn-Bench internally to evaluate AI functionalities across Security Copilot, Sentinel, and Defender. The resulting insights help developers identify weaknesses in detection logic and optimize computational efficiency.
Unlike earlier open methodologies such as CyberSOCEval and CTIBench, the new system is built upon incident graphs, or alert-entity graphs. Within these structures, nodes represent events and entities—such as suspicious downloads or user accounts—while edges define their interrelations. From these graphs, ExCyTIn-Bench generates explainable question–answer pairs, serving as standards for evaluating reasoning quality. As a result, the benchmark measures not only final outcomes but the entire analytical process: planning, data navigation, tool selection, and evidence synthesis.
The benchmark also introduces a stepwise reward system: each model action is scored according to intermediate progress rather than a simple “right or wrong” dichotomy. This transparency illuminates which reasoning steps lead to errors and which enhance overall precision. Organizations thus gain more than a mere success rate—they receive a detailed understanding of how the model reasons, ensuring its conclusions are verifiable and compliant with trust and governance standards in AI operations.
Developed in an open format, ExCyTIn-Bench invites researchers and vendors worldwide to conduct comparisons and share findings. Microsoft plans to extend the platform with custom test generation tailored to specific threat profiles relevant to individual client infrastructures. This will enable organizations to build bespoke investigation scenarios and evaluate models against data most representative of their own environments.
Initial trials reveal that modern language models are indeed becoming more adept. In benchmark testing, GPT-5 with enhanced reasoning mode achieved the highest score—56.2%, surpassing all prior generations. Remarkably, compact versions such as GPT-5-mini, which leverage the Chain of Thought reasoning method, nearly matched the accuracy of their larger counterparts while remaining more resource-efficient. The study also found that reducing reasoning depth decreased performance by nearly 19%, underscoring the critical role of sequential analysis in incident investigation.
According to Microsoft, open-source models are gradually closing the gap with proprietary counterparts, making automated cybersecurity more accessible. Developers and practitioners can freely download and test ExCyTIn-Bench via its GitHub repository, as well as join the community to exchange results and refine the toolset. The platform is rapidly emerging as a new standard for evaluating whether AI can truly think like a SOC analyst and withstand the complexity of real-world attacks.
Support Our Threat Intelligence
If you find our technology report and cybersecurity news helpful, consider supporting our work.