Harness Engineering and the Governance Gap

Authored by Prakash Chandran

Last updated: April 2, 2026

Here's a question that's been nagging at me: if AI can now generate an entire production backend from a spec, who exactly is supposed to check whether it did the right thing?

Not whether it compiles. Not whether the tests pass. Whether the business logic it encoded—the pricing rules, the permission model, the approval workflow—actually reflects what your organization intended.

This question is at the heart of a concept that's starting to gain traction in engineering circles: harness engineering. The term isn't widely established yet, but the idea is resonating fast, and I think it's worth paying attention to—both for what it gets right and what it's missing.

The harness idea

The term traces back to an experiment at OpenAI, where a team spent five months building a production application with over a million lines of code—and zero lines written by a human. The engineers didn't write code. They designed the system that let AI write code reliably: custom linters, structural tests, a curated knowledge base, agents that periodically sweep for architectural drift. They called this system a harness.

Birgitta Böckeler, writing on Martin Fowler's blog, gave the concept its sharpest framing. She observed that for maintainable AI-generated code at scale, you have to constrain the solution space—trading flexibility for reliability. She also drew a useful distinction, borrowed from Kief Morris, between being in the loop (inspecting every line) and being on the loop (building the system that produces the right output).

Since then, Ossature, an open-source harness tool, has launched. LangChain published a breakdown of what harnesses are and why they matter. Engineering teams are beginning to talk publicly about building their own structured scaffolding for AI agents.

The central insight is compelling: AI agents work best in environments with strict boundaries and predictable structure. As the OpenAI team described, when an agent struggles, the right response is to ask what constraint or capability is missing—not to simply retry. The best harnesses give AI less—less context per task, less architectural freedom, less room to drift—and counterintuitively get more coherent results.

I think this is genuinely important. I also think it's only solving half the problem.

Who is the harness for?

As currently practiced, the answer is clear: developers. The specs are written by engineers. The verification is compilation and test suites. The build plans are TOML files. The review happens in terminals and code editors. The entire workflow assumes the person governing AI output could have written the code themselves.

But when AI generates backend logic for a real organization, that logic encodes business rules, pricing calculations, permission models, compliance constraints, and approval workflows. The people who own those decisions—product managers, compliance officers, finance teams—are rarely the people reviewing TOML build plans. (I'm willing to bet they don't even know what TOML is, and they shouldn't have to.)

The coherence problem—making sure AI-generated code is internally consistent—is being addressed. The comprehension problem—making sure the right people can understand and validate what was built—is not. And the comprehension problem is where the real risk lives.

Böckeler raised a version of this when she noted that the OpenAI write-up focused on internal quality but didn't say much about whether the software actually does what it's supposed to. A passing test suite tells you the code does what the spec says. It doesn't tell you whether the spec accurately reflects what the business intended.

I've written before about the criticality of validating what AI builds. But a harness that only engineers can read is a governance mechanism with a single point of failure.

From bespoke harness to platform

There's a second gap. The OpenAI team spent five months building their harness—custom linters, structural tests, knowledge bases, sweep agents—before generating a single line of production code. That made sense for their experiment. But is every team supposed to build their own from scratch?

Böckeler asked exactly this when she wondered whether harness techniques can work for existing applications or only for greenfield projects. The answer from most harness-oriented tools right now is: it's complicated.

But the principles underneath harness engineering—opinionated structure, constrained AI output, verification before production, human review—shouldn't require a five-month custom build. When we moved from managing our own servers to cloud platforms, we didn't ask every team to build their own auto-scaling and deployment pipelines. We embedded those capabilities into the infrastructure. The same logic applies here.

What the complete harness looks like

So if the current version of harness engineering solves the coherence layer—making sure AI-generated code is consistent and technically verified—what would a complete harness look like? I think it has three layers.

Structured patterns, not freeform generation. Current harnesses constrain AI through specs and narrow context windows, which is a good start. But the constraint should go deeper than the prompt. If AI-generated logic is channeled through opinionated, standardized patterns at the platform level—where every API, workflow, and data model follows the same structure by default—then consistency isn't something you verify after the fact. It's enforced before the code is even generated. This is the difference between checking whether the AI followed your architectural rules and making it structurally difficult to break them.

Visual validation for human comprehension. If the harness constrains and verifies AI output, then a visual layer is what makes that output legible to the people who need to approve it. Not just engineers reviewing generated code, but product owners verifying business rules, compliance teams checking policy enforcement, and team leads understanding how logic flows before it reaches production. The democratized pull request isn't just about letting more people propose changes—it's about letting more people understand changes. That requires the review layer to meet people where they are, not in a terminal, but in a representation that makes system behavior visible at a glance.

Agent-safe infrastructure. Even a well-constrained agent working through standardized patterns can produce logic that's syntactically valid but behaviorally wrong. A complete harness needs isolated, ephemeral environments where AI-generated code can be tested against real conditions before it touches production. The workflow should be generate → test → deploy, not generate → deploy → discover. Guardrails on the agent aren't enough. You need guardrails on the infrastructure.

Coherence, comprehension, and infrastructure. That's the complete harness.

Like this take on the future of software development in the AI era? Get the latest posts straight in your inbox by subscribing to the Futureproof newsletter on LinkedIn.