JetBrains Survey: Why AI Stops at the CI/CD Edge

JetBrains has a piece out on AI adoption in DevOps, and the survey results are in: AI has colonized nearly every stage of software development except the one that actually ships code. More than 90% of developers use AI in day-to-day work, but 73% of organizations don't use AI in their CI/CD pipelines at all, and among teams that do, most don't delegate. The use is advisory, not autonomous. The top blockers teams cite are "unclear use cases" (60%), lack of trust (36%), and data privacy (33%).

The article's best idea is the reframe underneath those numbers: CI/CD isn't really automation. It's an evidence system. Its job is to produce signals trustworthy enough that a team will bet a release on them. Builds, tests, scans, deploys: each one answers is this safe to ship?

Note: JetBrains publishes TeamCity, which is a CI/CD product, so there's a natural inclination on the part of readers to presume a bias toward "the pipeline needs a trust layer," which is to say, toward the value proposition of a better pipeline. This is not a "disinterested source." JetBrains doesn't operate in bad faith, but survey interpretation is always downstream of who's doing the interpreting, and there are at least two things this piece either skims past or misses outright.

It's hard to A/B test a pipeline

The first missing piece is the feedback loop. In the IDE, AI either helps or it doesn't, and you know in seconds. (Well, hopefully.) Wrong autocomplete, delete. Right debug suggestion, move on. The cost of a bad suggestion is the time it took to read and reject.

CI/CD doesn't work that way. If an AI-tuned test selector drops a flaky integration test that would have caught a regression, you don't find out in three seconds. You find out in six weeks, from a customer, after a deploy that looked green. The signal loop is diffuse, lagged, and statistically thin. Most of the time everything's fine, which makes it maddeningly hard to tell whether AI is helping, hurting, or invisible.

That is what the 73% actually measures. JetBrains reads it as teams being appropriately cautious about risk. A harder reading: teams have no falsifiable way to tell whether AI in CI/CD is helping, and rational operators don't adopt what they can't evaluate. It isn't "unclear use cases." It's unclear evidence. This is the devops group buying the hype more carefully because that's their role.

CI/CD is the test

The second missing piece is structural. Few teams have staging pipelines that faithfully replicate production, because replicating production is itself a production problem. The validator of record is the thing. When AI proposes a pipeline change (a new stage, a different test partition, a reordered dependency), there is no outer check. You ship it or you don't.

This isn't unique to CI/CD. You see the same pattern in ordinary code: we rarely test our tests. We just run them. A passing unit test is assumed to be correct because it passed; a failing one is assumed to be meaningful because it failed. Test code gets reviewed, sometimes, but almost never tested in anything like the way the code it validates gets tested. CI/CD is the extreme case of that same inversion: the most expensive test, the last test, tested the least.

This is where the "trust layer" conclusion thins out. You end up having to construct a trust layer above the trust layer that creates a trust layer: you end up with a situation of Quis custodiet ipsos custodes? - "who watches the watchmen?"

Either you accept that CI/CD is terminal and treat AI-in-pipeline with the paranoia reserved for AI-in-kernel code, or you build a second validation tier, which nobody has and nobody wants to fund. The honest position is the first one, and it's why seasoned DevOps people tend to flinch at agentic framings in this part of the stack.

The maturity model points the wrong way

JetBrains offers a four-stage model: no AI, then AI that explains failures, then AI that proposes changes, then agentic workflows. It reads like a ladder, with the implication that teams will climb it as trust develops.

Invert it and it reads better. Stage 2 (AI reading logs, summarizing failures, pointing at likely root causes) is the ceiling for measurable adoption, not a waypoint. Triage has a real feedback loop: an engineer reads the summary, checks the log, and knows relatively quickly whether the suggestion is useful. Time-to-fix is something you can measure, with or without AI, and compare directly. Stages 3 and especially 4 are where the epistemology breaks, because the things AI is doing at those stages aren't the things CI/CD can evaluate.

The survey itself quietly confirms this. Among the minority using AI in CI/CD at all, most aren't delegating. The use is advisory¹. So the real split in the data isn't AI vs. no-AI. It's passive AI (reads logs for me, explains a failure, suggests a next step) versus active AI (selects tests, rewrites config, triggers reruns). Passive AI has a feedback loop and is working. Active AI doesn't have the feedback loop, and isn't working in measurable ways, and the survey numbers are quietly reporting exactly that.

JetBrains has real data here, and a real insight in the evidence-system framing. The conclusion would be stronger if it stopped one step earlier: AI belongs in CI/CD wherever it can be evaluated, and the evaluability problem is the whole problem.

This might be a good model for coders to follow as well: instead of letting the AI just do it and hoping, maybe the AI suggests paths and the human evaluates the path before following it.
↩

It's hard to A/B test a pipeline

CI/CD is the test

The maturity model points the wrong way

Comments (0)