The AI Revolution Moves to the Terminal: How New Tools Are Automating Software Development Beyond Code
In recent years, software developers have grown accustomed to relying on AI-powered tools embedded within code editors. Products like Cursor, Windsurf, and GitHub Copilot have become the de facto standards in this domain. Yet, a subtle but significant shift is now underway—AI models are increasingly interfacing not with code directly, but with the operating system’s terminal. This development holds the potential to radically redefine the landscape of automated software development.
The terminal—familiar to many as a cinematic relic of 1990s hacker culture—remains an exceptionally potent interface for system control, despite its seemingly antiquated appearance. While coding assistants excel at generating and debugging source code, it is through terminal commands that these scripts are transformed into functional software: dependencies are installed, builds are compiled, and projects are deployed and debugged within live environments.
Since the beginning of the year, several leading AI research labs—Anthropic, DeepMind, and OpenAI—have launched their own command-line tools: Claude Code, Gemini CLI, and CLI Codex. Though these tools retain the branding of their predecessor models, their operation diverges fundamentally. Rather than engaging with code in isolation, they interact with the computer as a holistic system. This evolution necessitates a new class of tasks, methodologies, and benchmarks.
According to Mike Merrill, co-author of the Terminal-Bench test suite, as much as 95% of future interactions between AI and computers may transpire through the terminal. His team developed the benchmark to assess how effectively AI agents perform tasks that transcend traditional code editing. Examples include compiling the Linux kernel from source, reconstructing a compression algorithm from a decompression routine, or configuring a Git server entirely without prompts.
The resurgence of interest in the terminal is also fueled by mounting limitations in conventional AI code editors. Windsurf, for instance, has been embroiled in corporate reshuffling: part of the team defected to Google, and the company itself was acquired by Cognition, casting doubt on its continuity. Meanwhile, a study by METR on Cursor Pro revealed that despite its advertised productivity gains, the tool actually slowed development by nearly 20%. Developers, it seems, had overestimated its efficacy.
The contrast between tool generations is especially stark in their approach to testing. Code editors like Cursor are optimized for GitHub-style tasks: locate a bug and fix it—a model exemplified by the widely used SWE-Bench. Terminal agents, on the other hand, engage with the entire system. They must be capable of initiating processes, configuring environments, and interacting with files, network services, and hardware components.
In the more complex challenges posed by Terminal-Bench, agents are often given no instructions—they must intuit the objective and navigate toward a solution independently. Even today’s most sophisticated models solve only about half of the test cases. Nevertheless, tools like Warp already demonstrate AI’s capacity to autonomously handle routine tasks: setting up environments, resolving dependencies, launching projects—and, when unable to proceed, articulating the reasons why.
This evolving mode of interaction draws AI ever closer to the role of a true developer’s assistant—not merely writing lines of code, but also stewarding the software environment itself. And all of this unfolds within the familiar confines of the terminal, which, unexpectedly, has become the main stage for a new era of AI-driven innovation.