Automated software testing is growing in importance in the era of AI-assisted coding. Agentic coding systems accelerate development but are also unreliable. Agentic testing — where you ask AI to write tests and check your code against them — is helping. Automatically testing infrastructure software components that you intend to build on top of is especially helpful and results in more stable infrastructure and less downstream debugging.
Software testing methodologies such as Test Driven Development (TDD), a test-intensive approach that involves first writing rigorous tests for correctness and only then making progress by writing code that passes those tests, are an important way to find bugs. But it can be a lot of work to write tests. (I personally never adopted TDD for that reason.) Because AI is quite good at writing tests, agentic testing enjoys growing attention.
First, coding agents do misbehave! My teams use them a lot, and we have seen:
- Numerous bugs introduced by coding agents, including subtle infrastructure bugs that take humans weeks to find.
- A security loophole that was introduced into our production system when a coding agent made password resets easier to simplify development.
- Reward hacking, where a coding agent modified test code to make it easier to pass the tests.
- An agent running "rm *.py" in the working directory, leading to deletion of all of a project's code (which, fortunately, was backed up on github).
In the last example, when pressed, the agent apologized and agreed “that was an incredibly stupid mistake.” This made us feel better, but the damage had already been done!
I love coding agents despite such mistakes and see them making us dramatically more productive. To make them more reliable, I’ve found that prioritizing where to test helps.
I rarely write (or direct an agent to write) extensive tests for front-end code. If there's a bug, hopefully it will be easy to see and also cause little lasting damage. For example, I find generated code’s front-end bugs, say in the display of information on a web page, relatively easy to find. When the front end of a web site looks wrong, you’ll see it immediately, and you can tell the agent and have it iterate to fix it. (A more advanced technique: Use MCP to let the agent integrate with software like Playwright to automatically take screenshots, so it can autonomously see if something is wrong and debug.)
In contrast, back-end bugs are harder to find. I’ve seen subtle infrastructure bugs — for example, one that led to a corrupted database record only in certain corner cases — that took a long time to find. Putting in place rigorous tests for your infrastructure code might help spot these problems earlier and save you many hours of challenging debugging.
Bugs in software components that you intend to build on top of lead to downstream bugs that can be hard to find. Further, bugs in a component that’s deep in a software stack — and that you build multiple abstraction layers on top of — might surface only weeks or months later, long after you’ve forgotten what you were doing while building this specific component, and be really hard to identify and fix. This is why testing components deep in your software stack is especially important. Meta’s mantra “Move fast with stable infrastructure” (which replaced “move fast and break things”) still applies today. Agentic testing can help you make sure you have good infrastructure for you and others to build on!
At AI Fund and https://t.co/zpIxRSuky4’s recent Buildathon, we held a panel discussion with experts in agentic coding (Michele Catasta, President at Replit; Chao Peng, Principal Research Scientist at Trae; and Paxton Maeder-York, Venture Partnerships at Anthropic; moderated by AI Fund’s Eli Chen), where the speakers shared best practices. Testing was one of the topics discussed. That panel was one of my highlights of Buildathon and you can watch the video on YouTube.
[Original text: https://t.co/B1sQ5oDnCU ]