An Architecture for Trustworthy AI Compliance (You Can Read the Source)

Compliance demands two things a large language model structurally cannot provide.

It demands determinism. The same facts have to produce the same regulatory conclusion every time, and they have to produce it for a reason you can point to. An LLM is a probability machine - run the same prompt twice and you can get two different answers, with no stable account of why.

It demands accountability. Someone has to be answerable when the call is wrong - a named human who can be questioned, corrected, sued, or struck off. An LLM can be none of those things. It has no skin in the game because it has no skin.

You cannot prompt your way out of either problem. A cleverer prompt does not make a probabilistic system deterministic, and it does not give a model a professional licence to lose.

This is the core reason you can't simply point an LLM at compliance and ship the output - the argument I made at length in the first part of this series. The more interesting question is the one it leaves open: if the model can't make the decision, what does building this properly actually look like?

Here's how we did it. And why every piece of it is open source on GitHub.

Compliance-as-code, and what it's actually for

The conceptual move that unlocks everything is treating compliance as an engineering problem rather than a consulting one.

Infrastructure-as-code did this for infrastructure. Before Terraform and its successors, infrastructure was clickops in a cloud console: configuration drift, snowflake servers, and tribal knowledge living in the heads of three sysadmins who you hoped wouldn't leave in the same week. After Terraform, infrastructure became declarative configuration - version-controlled, peer-reviewable, reproducible, testable in CI/CD.

Compliance today is roughly where infrastructure was in 2012. Questionnaires in spreadsheets. Evidence collected by chasing engineers on Slack. Policies nobody reads, written by consultants who have since moved on. Findings that live in a slide deck and nowhere else. The whole thing is opaque to the engineering and product teams who actually built the systems being assessed.

Compliance today is roughly where infrastructure was in 2012: spreadsheets, tribal knowledge, snowflake servers. We think it's time for the Terraform moment.

But here's the part that's easy to get wrong. Compliance-as-code is not an answer to the question of who makes the regulatory decision. It does not automate judgement. What it does is give judgement a place to live where you can see it.

A regulatory decision - is this a special category of personal data under GDPR Article 9? does this processing need a DPIA? - still belongs to a human with legal expertise. Compliance-as-code is the container that isolates that decision, writes it down as an inspectable artefact, version-controls it, and lets anyone audit it. The decision stays human. The framework makes that decision reproducible and reviewable instead of trapped in a consultant's head or buried in a PDF.

That distinction underpins the whole design.

What this looks like in practice

The Waivern Compliance Framework explainer. Evidence gathering (connectors pulling from organisation policy docs, application and infrastructure code, and cloud services) feeds composable analysis (compliance analysers passing standard-schema messages, with human-authored rulesets and legal/compliance review feeding in), which feeds outputs (workflow management, monitoring, document drafting, custom integration). An orchestration layer of open-standard YAML runbooks underpins the whole pipeline.

The Waivern Compliance Framework has five core concepts.

Runbooks (YAML). You declare the analysis you want to run: which connectors fetch from which sources, which analysers process the output, which rulesets to apply, what the final deliverables look like. A runbook is a typed, versioned, peer-reviewable artefact - the compliance equivalent of a CI/CD workflow file. It's also where a legal expert becomes an author rather than a reviewer. A privacy specialist can map a compliance process into a runbook the same way a DevOps engineer maps infrastructure into Terraform, and the result gets reviewed, diffed, and re-run like any other code.

Connectors. Components that pull data from real technical sources: MySQL, MongoDB, filesystems, source code. Not "describe your data processing in this questionnaire" but actual queries against actual databases and actual analysis of actual source code. This solves a specific, expensive problem: self-reported questionnaires are unverifiable ("to my best knowledge...") and go stale and forgotten the moment they're submitted. A connector produces evidence grounded in the system's real state and reproducible. Run the runbook again next week and you get the same finding from the same code - or a different one because the code changed, which is exactly what you want to know.

Connectors are also where open source quietly becomes a structural advantage rather than a slogan. Some vendors boast about thousands of integrations. Thousands of integrations is thousands of lines of code somebody has to maintain, plus the UI to configure them. With open, MCP-style connectors, that work doesn't all fall on us, and a client can build their own connector against a proprietary internal system we've never heard of without waiting for our roadmap. (Anthropic has been pushing hard to make this kind of connector cheap for everyone, which helps the whole category.)

Analysers. Components that detect compliance issues: a personal-data analyser, a processing-purpose analyser, a data-subject analyser, and so on. They combine deterministic pattern matching with optional LLM validation under tight constraints. The problem they solve is coverage - finding the needles (an email column here, a tracking SDK there) across a codebase far too large for a human to read line by line, then handing the candidates up the chain for classification.

Rulesets (YAML). This is where the regulatory decision actually lives, and it's the answer to everything above. Legal and regulatory expertise is encoded as YAML - patterns, classifications, decision logic - written and reviewed by people with genuine legal expertise. When the framework finds personal data, the judgement about what kind it is and which regulatory category it falls under comes from a ruleset, not from asking a model. An LLM might validate or contextualise a finding, but the legal authority sits in version-controlled YAML you can read.

The legal authority lives in YAML rulesets a human wrote. The LLM doesn't decide what counts as a GDPR Article 9 issue - it just helps with the routine work around the decision.

Schema-driven message passing. Every component declares its input and output schemas in JSON Schema, and the executor validates that connectors and analysers are actually compatible. Messages between components are typed and checked at runtime. When an LLM is asked to produce a finding, its output has to conform to a declared schema with reasoning a human can verify. There is no typed channel through which hallucinated free text can leak into the final report.

Underneath, the whole thing runs as a directed acyclic graph: the planner parses the runbook, validates schemas, flattens dependencies, and the executor runs components in parallel where it can. If you've used Airflow, Dagster, or any modern CI/CD system, you already know the shape of it.

Why this is trustworthy in a way pure-AI tools aren't

For a buyer worried about AI making things up, here's what the architecture buys you.

A human designs the workflow, not an agent. The interesting failure mode of agentic compliance isn't a single wrong answer - it's an autonomous agent deciding, on its own, which checks to run and which to skip. In our framework a human decides that: the runbook spells out which analysers run against which sources under which rulesets, and that plan is a reviewed, version-controlled artefact. The agent doesn't get to improvise the audit. A person designs it, and you can read the design.

The pipes carry structured data built for compliance, not free text. Every component passes typed messages validated against a declared schema. That schema is the point: it's high-quality, purpose-built compliance data - a finding has a category, a source, a location, a regulatory basis - not a paragraph of prose a model happened to generate. Where a model is involved, its output has to land in those fields or it doesn't pass. There is no typed channel through which a confident-sounding sentence can become a finding without being structured, classified, and checkable first.

Every finding traces back to something you can re-examine. Because findings come from databases, source code, and filesystems rather than questionnaires, each one points at the actual state of the system, and re-running the runbook against the same state reproduces it. Compare that to the alternative: good luck auditing what a model was thinking the day it decided a 16-year-old no longer counts as a minor. A finding you can re-derive is evidence. A finding you have to take on trust is just an opinion with good formatting.

It bends to fit the client, not the other way round. Because review is a pluggable component rather than a hard-wired step, the same framework serves very different kinds of customer. A regulated scale-up can route findings to its in-house legal team. A small startup with no legal function can route them to our managed service. A firm with an existing relationship with outside counsel can plug that firm in. Big or small, self-hosted or managed, one regime or five - the architecture adapts to who is doing the judging instead of forcing everyone through the same workflow.

We're not betting against better models

One thing we are deliberately not claiming is that models will never be good enough to take on more of this.

They will keep improving, and as they do we will move more steps of the runbook onto them - transparently, with the schema and ruleset gates still in place. Our position has never been "AI can't be trusted here." It's "wherever a model earns its place, you should be able to see exactly where it sits and what is checking it." When a model gets reliable enough to own a step that a human owns today, that's a one-line change to a runbook anyone can inspect, not a quiet upgrade to a black box.

This is also why we don't put any faith in the wrapper approach. A bare prompt is unpredictable by nature, which makes it about as good as having nothing at all. The value was never in the prompt - it's in the structure around it, and the structure is what survives as the models change underneath.

Why open source

The most common reaction to "we open-sourced our compliance framework" is some version of why on earth would you do that?

The honest answer is that for this specific problem, open source isn't generosity - it's the only credible posture.

Think about what you're really asking a buyer to trust. You're asking them to believe that your software, pointed at their systems, will produce findings they can defend to a regulator. That's a far bigger ask than "trust us with your customer list." If they can't see what the software checks and how it classifies what it finds, they're trusting you on faith.

Closed-source compliance tooling asks you to trust on faith that the software will produce defensible findings. Faith doesn't survive contact with a regulator.

When the framework is open:

A buyer's engineers can read the connectors and verify exactly what is being extracted from their systems.
A buyer's legal team can read the rulesets and confirm they encode the regulations they care about.
Security-conscious buyers can audit the entire data flow before granting access to production.
The community can contribute rulesets for regulations or interpretations we don't yet cover, and connectors for sources we don't yet support.

This is structurally impossible for closed-source SaaS competitors. OneTrust can't show you their compliance logic. Vanta can't let you read the rules they apply. Not because they're hiding something sinister, but because their business model depends on that IP staying closed. Ours doesn't.

We open-sourced the framework because for this specific problem, open source is the only credible posture.

So where's the business? In the layer above the open core: the managed service, expert human review for the findings that need judgement, hosted infrastructure for buyers who don't want to self-host, ruleset updates as regulations move, support, integration work. The same open-core pattern that works for HashiCorp, Elastic, GitLab, and MongoDB.

Continuous compliance: the part everyone forgets

The market talks endlessly about getting certified and almost never about staying compliant. The audit is the forcing function and the certificate is the deliverable, so that's where all the attention goes.

But the actual work of compliance is continuous. Your codebase changes. Your vendors change. Your data flows change. The regulations themselves change - DORA is in force, the EU AI Act is phasing in, US state privacy laws keep multiplying. A SOC 2 Type II report covers a defined window, but you're either operating compliantly on all the other days or you aren't.

Compliance-as-code is uniquely suited to this, because the pipeline runs on every commit. Add a third-party SDK? The runbook flags it on the next run. Start processing a new category of personal data? The analyser catches it. Stand up a new database? The connector picks it up.

The underlying principle is simple: decompose the compliance workflow into steps, and automate the ones that are either low-value busywork or genuinely improved by automation - while routing the judgement calls to humans. Continuous compliance isn't a separate product. It's what you get for free once compliance lives in the engineering workflow instead of in a quarterly project.

What this all comes down to

We didn't choose this architecture because we read a book about open-core business models. We chose it because the alternatives don't work for the problem.

Closed-source SaaS doesn't work, because buyers can't audit what's being done with their data or their regulatory posture. Pure-AI doesn't work, because the hallucination problem is real and the legal consequences don't retract. Pure consulting doesn't work, because it doesn't scale and it prices early-stage companies out entirely. Traditional GRC platforms don't work, because they're built and priced for enterprises with dedicated compliance teams.

Open-core compliance-as-code is what you get when you start from "how do we make this trustworthy at scale and affordable for companies that aren't enterprises" and let the architecture follow.

There's a question worth asking any compliance vendor, including us: where in your pipeline does a regulatory decision actually get made, and by whom? For a lot of tools the honest answer is "our AI," and they'd rather you didn't ask. Our answer is specific and checkable: the decision lives in a YAML ruleset a human wrote, the model only does the routine work around it, and you don't have to take our word for any of it. The source is right there.

So if this resonates - if you're a CTO, a head of engineering, a founder, or a privacy lead trying to make compliance work without paying enterprise prices or trusting AI you can't inspect - go read the code. Open an issue. Join the Discord. We're building this in the open because we think it's the only honest way to build it.

This is the second of three pieces on building AI compliance honestly. Part 1 makes the case that regulatory judgement belongs to humans; Part 3 extends the architecture here into security and AI regulation, under the pressure of AI-accelerated development.

I'm CTO and co-founder of Waivern. The framework discussed here is open source at github.com/waivern-compliance.