AI Didn't Break the Rules. It Broke the Way We Check Them

AI didn't break application security, or code quality, or compliance. It broke the belief that any of them can be a single checkpoint.

That belief has been load-bearing for a long time. You build the thing, and then - before release, once a year, at the end of the quarter - you run the scan, book the pen test, sit the audit. The check was a gate you walked through on the way out the door. It worked because the door didn't open very often.

With AI writing most of the code, the door now never closes. And to explain why that changes everything - and why it's a structural problem, not a security problem - I need to start with what's actually happening in engineering teams right now, including Waivern's.

Part 1: Is the AI-driven development lifecycle real?

The adoption numbers are no longer in dispute.

Stack Overflow's 2025 survey of more than 49,000 developers put AI tool usage at 84%, up from 76% in 2024. DX's Q4 2025 impact report, drawn from a sample of over 135,000 developers, found 91% adoption and that 22% of merged code is now AI-authored - meaning generated by AI and merged without major human rewrites. Other surveys put the "AI-assisted" figure closer to 42% of committed code.

It has happened. It's no longer a choice.

Now look at the other side of the same coin. As adoption has climbed, trust has fallen. Stack Overflow's positive sentiment toward AI tools dropped from over 70% in 2023 to 40% in 2024 to just 29% in 2025. The single biggest frustration, cited by 66% of developers, is AI solutions that are "almost right, but not quite," and 45% say debugging AI-generated code takes longer than writing it by hand. The steeper the adoption curve, the steeper the trust decline.

I've been using Claude Code daily since it was in closed beta, and that second curve matches my experience far better than the first. So here is what I'd actually tell you, from the chair of someone who runs an AI-native engineering shop.

AI accelerates writing code. It does not accelerate the work around writing good code - it amplifies it. It demands more architecture and planning up front, more hand-holding, more genuine co-working, and a tremendous amount more reviewing. The bottleneck doesn't disappear. It moves. It moves to review.

"Let the AI do everything and just review the PRs" does not work. It sounds like the dream workflow, but it's a trap. When you only see the final diff, you miss the errors in judgement the AI made along the way - the decision to reach for a heavyweight library, the pattern it chose, the edge case it quietly skipped or decided to add even when it was unnecessary. The end result usually looks fine. It's just not optimal, and "looks fine" is exactly the failure mode that survives a quick review. It's the AI-era version of an engineer who keeps bolting on another if-else instead of refactoring, while the reviewer hits approve because if-else is easy - not because the code is refactored properly.

The "five parallel agents" thing is a productivity fantasy. Some people will tell you they've got five-plus Claude Code sessions running at once. I would not advise it. The constraint was never how fast you can generate code; it's how much you can actually understand and stand behind. You are the gate. Five streams of code you haven't internalised is five streams of liability.

Enter Key

If you hit Enter without reading the edit, you are inviting trouble. Blind trust in AI is the surest way to accumulate tech debt faster than you can chew. More AI code, more debt, and debt on debt compounds - the maths really is that simple.

And the debt isn't only architectural. It's security- and compliance-shaped, in ways that matter for this argument:

  • More code, more dependencies, more surface area - much of it pulled in unreviewed. AI loves a popular library and a well-worn pattern, because that's what it saw most of on the internet. That often means heavier dependencies than you needed and code that's been copy-pasted across a thousand repos, duplication and all. Remember the days when GitHub Copilot was always eager to autocomplete your code with the fine examples from "Fluent Python"?
  • Configuration drift. AI is genuinely excellent at producing the standard, cookbook-style config. The problem is that the standard config is almost never the one your system actually needs - it doesn't know your context, your threat model, or your non-functional requirements. It gives you the textbook setup, not the one you actually need.
  • You can't review what you can't read. If you're not able to scrutinise the new libraries, the new configs, the new test setups the AI introduced, you are heading somewhere you didn't choose to go.

If you want the cautionary tale in its most vivid form: in July 2025, during a "vibe coding" experiment, Replit's AI agent deleted a live production database during an explicit code freeze, then misrepresented what it had done. The agent later admitted it had run unauthorised commands and violated explicit instructions not to proceed without human approval. The wiped database held records on more than 1,200 executives and roughly 1,200 companies. The lesson isn't "AI is dangerous." It's that an autonomous system optimising for task completion will treat an obstacle - a schema conflict, a locked door - as something to remove rather than something to ask about. The hesitation a human engineer feels is not a lack of courageousness in the human.

(Whether AI should be doing the design - whether it's "there yet" as an architect - is a real debate, but a separate one. I'll leave it for another post.)

So: is the ADLC real? Yes. Code is being produced faster, by more people, with less human attention per line than at any point in the history of the field. That's the part everyone agrees on.

The part that gets less airtime is the consequence.

Part 2: What actually breaks

The traditional security and compliance model is a set of point-in-time gates. The pre-release penetration test. The annual ISO 27001 audit. The SAST scan that runs when someone remembers to run it. Each one is a photo of the system taken at a single moment, on the assumption that the system doesn't change much between snaps.

When 22% of merged code is AI-authored and a single engineer can generate a week's worth of changes in an afternoon, the system is no longer a photo album. It is an action movie now.

The annual audit certifies a system that no longer exists by the time the certificate prints. The pre-release pen test covers a release that's three releases stale by the time the report lands. The checkpoint can't keep pace with the rate of change, and the rate of change is the one number that's only going up.

The obvious objection is: we solved this years ago - it's called shift-left. DevSecOps. SAST and DAST in the pipeline. And shift-left was the right instinct. But it shifted the tooling left without shifting the judgement left. A SAST scanner in your CI pipeline will tell you a function looks like SQL injection. It will not tell you whether that finding matters given your data classification, whether it maps to a control you're actually obligated to operate, or whether the config the AI just generated violates a policy your ISMS commits you to. It produces findings. It does not produce decisions, and it has no idea what your policy is.

That's the gap. We moved the scanners earlier in the lifecycle and called it done, but we left the two hard parts - is this acceptable given our policy and context and which control does this satisfy - exactly where they always were: in a human's head, consulted at the gate, once. AI-accelerated development just removed the gate.

The fix is not a better gate. It's to stop treating the check as a gate at all - and to make it run at the speed the code is now written.

Part 3: And then AI shows up in the product itself

Everything above is about AI in the toolchain - AI helping you build. That's hard enough, but it's still a problem of degree: the same checks we always ran, now needing to run faster. There's a second problem, and it's different in kind. Conflating the two is a mistake.

Your clients aren't just building with AI; increasingly they're shipping it. And the moment AI is in the product, the thing being secured stops being a deterministic system you can reason about. It's a probabilistic model - one that behaves differently on inputs you didn't test, that can be prompt-injected, that can leak training data, that can be confidently wrong in a way that creates real liability. Traditional AppSec has no vocabulary for this, because AppSec was built to secure systems that do the same thing twice.

It gets harder with agents. An agent doesn't just answer; it plans, calls tools, and takes actions - it sends the email, writes to the database, deploys the code. And here's the structural killer: a catalogue of k tools composed across n steps produces up to kⁿ distinct execution paths. You cannot exhaustively test that space before release. The combinatorial explosion makes pre-deployment behavioural characterisation infeasible beyond toy configurations.

This is also where I watch teams reach for the wrong control. The instinct is to write a system prompt - "never delete files," "always ask before sending" - and call it a safeguard. It isn't. A system prompt is a natural-language request the model may or may not honour under prompt injection, jailbreak, or simple emergent behaviour. It is not a deterministic boundary. The Replit agent had instructions too. The only real controls are enforced below the model: least-privilege at the API layer, permissions scoped per action, irreversible actions gated by design rather than by politeness. If your access control lives inside a prompt, you don't have access control.

The EU AI Act's general-purpose AI obligations have applied since August 2025. The heavyweight high-risk rules - originally due August 2026 - are, as I write, being deferred to late 2027 under the Digital Omnibus, precisely because the standards and tooling needed to comply weren't ready in time.

Sit with that for a second. The deadline is slipping because the means to meet it continuously don't exist yet. That is the entire argument of this series, conceded by the regulator. A point-in-time deadline can be postponed; a continuous capability wouldn't need to be.

And the Act has a sting most builders haven't clocked.

Under Article 25(1)(c), if you deploy someone else's model for a high-risk purpose, you become the high-risk provider - you inherit the full obligations, including proving the training data was representative and sufficient. Except you often can't, because that information sits upstream with the model provider and was never passed to you. The liability lands on the builder who is least equipped to discharge it. Adam Leon Smith, writing for market-surveillance authorities, reaches this same conclusion from the legal side - that an agent's compliance has to be designed into its architecture, not asserted after the fact, because exhaustive testing and prompt-level instruction simply don't satisfy the essential requirements.¹

This is where it stops being a security problem and becomes a structural one, spanning engineering, compliance, and law at once. Nothing in the traditional kit was built for it. An ISMS - ISO 27001, the one most of our clients chase for enterprise deals - assumes you define your processes, follow them, and document that you did, with controls that operate every day and an auditor who checks the evidence once a year. That model has no concept of a control for a thing whose behaviour is generated at runtime and changes with the input. You can't write an annual control for a probability distribution.

So you do the only thing that works: you make the check continuous, and you make it part of the build rather than a gate after it. This is the same move Part 2 made for privacy, and it's exactly what the architecture was always for. Policy stops being a PDF and becomes something that runs. A runbook that's aware of your organisation's actual context - not the cookbook default - declares which connectors read which sources under which rulesets. Schemas carry a security finding and a regulatory classification through the same pipeline, with the same structured treatment, to the same human who owns the call. The routine detection gets automated, because that's the busywork. And rulesets let real expertise - security, privacy, and legal - be written down once and shift left into the loop, so the judgement runs at the speed the AI does instead of waiting for an audit that arrives a year too late.

The hardest part is the bit a builder genuinely can't see for themselves - is my use high-risk? have I become the provider under Article 25? whose training data am I now accountable for? That's not a gap you paper over with a prompt. That's where human, legal-led expertise belongs: on demand, inside the same system, not bolted on at audit. You cannot out-review a machine by hand. You can only out-review it with a system that runs continuously and escalates the genuine judgement calls to the people equipped to make them.

That's not a feature we added for AI. It's what the architecture was always for. AI in the product just made it urgent.

What it comes down to

Three posts, one argument, narrowing each time. Judgement belongs to humans. The architecture has to keep it there visibly. And now: when AI writes the code - and when AI is the product - the only way to keep judgement in the loop is to make the loop itself continuous, for code quality, for security, for privacy, and for the AI you're shipping.

A check as a checkpoint made sense when the door rarely opened. It doesn't open and close anymore. It's a turnstile that never stops, and a gate you walk through once a year is just decoration in front of it.

So here's the question I'd put to any team building fast with AI, the same way I closed the first two posts: where in your loop does a security, privacy, or compliance decision actually get made - and is it still being made at all, once the AI is writing most of the code and increasingly is the product? If the honest answer is "at the audit" or "in the pen test we run before release," the gap between those moments is where your real exposure lives. And it's getting wider every commit.

We built the Waivern Compliance Framework in the open precisely so that this loop is something you can read, run, and verify rather than take on faith. If the argument resonates, the source is right there: github.com/waivern-compliance.


¹ Adam Leon Smith, "9 things to look for in compliant agentic AI" (June 2026), building on the paper "AI Agents Under EU Law: A Compliance Architecture for AI Providers" (Nannini, Leon Smith et al., arXiv:2604.04604, April 2026). The kⁿ execution-path point, the "a system prompt is not a security control" principle, and the Article 25(1)(c) provider-attribution trap are drawn from his account of how the EU AI Act's essential requirements apply to agentic systems.

This is the third of three pieces on building AI compliance honestly. Part 1 argues that regulatory judgement belongs to humans; Part 2 shows the architecture that keeps it there.

I'm CTO and co-founder of Waivern. The framework discussed here is open source. Figures on AI adoption and developer sentiment are drawn from the Stack Overflow 2025 Developer Survey, the DX Q4 2025 AI Impact Report, and related industry research current as of early 2026; the Replit incident is as reported in July 2025.

As of early June 2026, the Digital Omnibus deferral is a provisional agreement reached 7 May 2026, awaiting formal adoption and publication; until then the original 2 August 2026 high-risk date technically still stands. The point holds either way - the deadline is being moved because the means to meet it weren't ready.