Correctness by Construction in the AI Age

Correctness by construction in the AI age means moving trust out of the assistant and into compilers, hooks, macros, CLIs, tests, and gates that make bad output unshippable.

It matters more now than it ever did before. At first glance the two ideas seem like antonyms. A fair question: how can anything be correct by construction when the same prompt fed to the same model can produce a different reply tomorrow? When model quality drifts as compute scales up or down?

AI agents are not the magic that they can often seem to be. Their learning and their neural networks are based off of humans, so it stands to reason they're going to have a lot of the same issues that humans run into. Again, prompting an AI to write the right code is just as good as prompting a human to write the right code. In my consulting experience, and quite honestly in my own code-writing experience, you can have 50 stand-up meetings dedicated to code quality. You can deny 50 PRs, and you can read recommended software architecture books until the cows come home. Humans are going to forget, get lazy, and/or not see the value.

At the time of writing this article, the idea of agents writing all code is a fallacy, but the idea that agents can only write slop is a far more egregious one.

You Must First Invent the Universe

We are going to do our own version of inventing the universe. That starts with the reason code exists at all, before we touch language choice, agents, hooks, or skills.

Code gives us speed, obviously. Determinism is the bigger prize.

A pacemaker cannot usually decide correctly. Flight-control software cannot improvise on the millionth loop because it got tired or inspired. The whole point of putting the rule in code is that the same input should hit the same rule, the same way, every time, at machine speed. If there is any chance of variability where determinism matters, it is instantly a non-starter.

If we want AI agents to scale, we have to drag them toward that same standard: fewer vibes and a repeatable path from instruction to acceptable output.

First question: what language are we writing in?

Do the mental experiment even if you already know what language you want to write in. Do it even if you have a project in your head right now and you are currently unwilling to switch languages. Do it even if you only know one language at this exact moment.

The language is paramount. Yes, we are going to get into guardrails inside your AI harness, but the compiler should be doing as much work as you can possibly make it do. The important rules need to live somewhere with teeth.

Wherever possible, use a type-safe language with strong opinions. If you are using libraries inside that language, use opinionated libraries too. Avoid libraries altogether when that is the correct shape for the problem. If you must use them, pick libraries that are strict about correctness and make nonsense hard to express.

Choosing your language gives you better guardrail material later. Some languages I would suggest:

TypeScript, especially if you are a beginner or not as confident in other languages. Everyone has to learn TypeScript and all its baggage, for better or worse, so I am obligated to say it first, and it brings me no joy to do so.
Rust, because the compiler acts exactly how you would expect from a language this obsessed with correctness. The ecosystem also has serious libraries, lints, and macro surfaces that let you push those rules into your domain model.
Haskell, because a suspicious number of the best ideas in other languages trace back to it. The only "downside" is that it can be too opinionated sometimes, but if you are reading this article, that is a check mark in the right direction for us.

There are plenty more, but the decision matrix is this: what does compiling tell me? In Haskell, Rust, Agda, Coq, and languages in that neighborhood, the type system can encode invariants, preconditions, invalid state transitions, and sometimes whole specifications. Compilation starts acting like a proof check. If it passes, you have ruled out null dereferences, data races, missing cases, impossible states, and, depending on how far you pushed the types, even some logic errors. In a language like TypeScript, compiling can tell you almost nothing if you let it. The closer you get to the proof-checker end of that spectrum, the better.

Throughout this article, I will give examples in Rust and TypeScript because they are two of my most familiar languages, and I think most developers can understand at least one of them.

Pit Before the Prompt

The guiding principle here is to make the AI fall into the pit of success. I did not coin that term, but it might be the best term I have ever heard, and it is one of the operating principles at Jakuta Inc. as a whole.

The original 2003 idea is usually credited through Brad Abrams' writing, with Rico Mariani named as the person who coined the phrase. The cleaned-up version I would point people to is The Pit of Success on The Philosopher Developer, because the old MSDN version is not exactly a joy to read in the modern internet ruins.

That idea was aimed at developers: design APIs so the right thing is the natural thing, and the wrong thing is hard to reach. Some people would call that "opinionated," but opinionated can mean opinionated in the wrong direction. The point is to make the bad move hard to make in the first place.

That same idea matters even more when you are working with AI. At least with a human, hopefully, you can assume they are not hallucinating while they write code. The AI can hallucinate much, much more confidently. Choosing the right language pushes it closer to the pit of success before the agent touches the codebase at all.

The next thing that helps, of course, is skills. If you have never used them, start here, but the short version is this: Claude reads skill files the same way users read Terms of Service; sometimes, and often not at all. A skill is a nudge in the right direction.

Agents often follow skills pretty well these days, especially when a skill is invoked directly. The issue is that skills can drift. Skills are sort of like code in my eyes: they need upkeep. What they give us is a way to fill in the gaps where we cannot be fully deterministic, but they are not enforcement.

So "just use this skill and your agent will never hallucinate again" is out. What is in?

Hooks >

Hooks are the only thing in the agent stack that actually scales. Multi-agent workflows need hooks. Agent orchestration needs hooks. Smaller local models need hooks. The rest of the stack gets useful because hooks force every write through the same rules; without them, the workflow either stops scaling or becomes unusable the second you stop watching it.

A smaller model may not reason as well as the expensive one. For real work, local agents are usually not viable by default. In certain domains, with the right set of hooks, they can become usable because the hooks constrain the output harder than the model ever will on its own.

That opens up a lot of workflows. You can have a high-level agent do the reasoning, design the change, and hand the implementation to a worker. Without hooks, that worker might write something like this:

type SaveUserResult = {
  // Boolean state makes the caller reverse-engineer which fields are valid.
  ok: boolean;
  // These optionals allow nonsense like `{ ok: true, error: "bad" }`.
  user?: User;
  error?: string;
};
 
// `any` means the compiler has clocked out. Have fun debugging `user.id`.
async function saveUser(user: any): Promise<SaveUserResult> {
  try {
    return { ok: true, user: await db.users.save(user) };
  } catch (error) {
    // Every failure becomes a string: database, validation, permission, all soup.
    return { ok: false, error: String(error) };
  }
}

The sloppy version is much faster to type. It is probably fewer tokens for the agent to spit out. Depending on the size of your project, it might even get the product shipped sooner.

Then it's 3 a.m. This works fine as long as nothing ever changes. Something does change, in exactly the wrong way, and now you have a stringly typed error message while you have paying users and a contract to fulfill. Enter: the pit of despair. In my personal experience, I can safely say the "speed gains" have never been worth the sanity losses from writing "pragmatic" code.

With the right hooks, the agent gets forced into this kind of shape:

type SaveUserFailure =
  | {
      kind: "ValidationFailure";
      field: "email" | "displayName";
      message: string;
    }
  | {
      kind: "PermissionFailure";
      requiredRole: "Admin" | "Support";
    }
  | {
      kind: "DatabaseFailure";
      operation: "insertUser";
      retryable: boolean;
    };
 
type SaveUserResult =
  | {
      kind: "Saved";
      user: User;
    }
  | {
      kind: "Rejected";
      failure: SaveUserFailure;
    };
 
async function saveUser(user: NewUser): Promise<SaveUserResult> {
  const validation = validateUser(user);
 
  if (validation.kind === "Invalid") {
    return { kind: "Rejected", failure: validation.failure };
  }
 
  const savedUser = await db.users.save(validation.user);
 
  return { kind: "Saved", user: savedUser };
}

Now you get to know what happened. The caller switches on kind, the error carries the domain reason, and the compiler can tell you when you forgot to handle a case. If your hooks block the sloppy shape, the write bounces. The agent has to try again.

Sure, the worker might run into a few issues. It might waste a few tokens, but depending on where you run your agent, that might not matter very much. If the worker is local, you are paying with time instead of API spend. Across enough implementation work, that workflow can save serious money.

The same thing applies when you are using a high-level agent like Codex directly. The benefits get absurd. You can use the /goal loop and trust that the code will land somewhere inside the bounds you permit. You can review PRs with less paranoia when a good set of hooks is blocking non-idiomatic code, or whatever your project defines as non-idiomatic code.

This even lets you do something I do not recommend: write in a dynamically typed language and claw back some structure with hooks. In Lua, for example, a hook could block this sort of stringly typed state:

function charge_user(user, amount, mode)
  -- `mode` is free text, so every typo becomes a new unofficial state.
  if mode == "test" then
    -- `nil` means "test skipped" only if every caller memorizes the convention.
    return nil
  end
 
  if mode == "live" then
    return payment.charge(user.id, amount)
  end
 
  -- `false` is a second failure language. Very generous. Very cursed.
  return false
end

That is exactly the kind of thing you can refuse at the file boundary: magic string modes, nil as a domain result, false as a second failure shape. With enough of those checks, Lua can start feeling closer to TypeScript. Then TypeScript can start feeling closer to Haskell.

Force the agent toward this instead:

-- Named modes make the accepted inputs visible in one place.
local ChargeMode = {
  Test = "test",
  Live = "live",
}
 
-- One result tag gives callers a single field to switch on.
local ChargeResultKind = {
  Skipped = "skipped",
  Charged = "charged",
  Rejected = "rejected",
}
 
function charge_user(user, amount, mode)
  if mode == ChargeMode.Test then
    -- Test mode is a real result, not a secret `nil` convention.
    return { kind = ChargeResultKind.Skipped }
  end
 
  if mode == ChargeMode.Live then
    return {
      kind = ChargeResultKind.Charged,
      receipt = payment.charge(user.id, amount),
    }
  end
 
  -- Unknown mode is named as a rejection instead of hiding behind `false`.
  return {
    kind = ChargeResultKind.Rejected,
    reason = "unknown_charge_mode",
  }
end

Every branch returns the same outer shape, the result says what happened, and the caller can switch on kind instead of praying that nil means "test" and false means "something else went sideways, enjoy."

When the same bad shape keeps coming back, the hook can inject context right at failure time: "You used string modes. Use a discriminated union." The next attempt gets the relevant rule at the exact moment the model needs it.

In real projects, hooks reject shapes like these:

- Rust code reaching for `unwrap`, `expect`, `panic!`, or `todo!` at a domain boundary: rejected
- Rust APIs collapsing domain states into `bool`, `Option`, or string modes instead of an enum/newtype: rejected
- Haskell exports using partial functions like `head`, `tail`, `fromJust`, `!!`, or unguarded `read`: rejected
- Haskell pattern matches hiding future constructors behind a catch-all wildcard: rejected

Now imagine putting hooks around Haskell. It is already one of the strongest languages for encoding correctness into the type system. Hooks let you push those rules even further into the workflow.

Festina Lente

At Jakuta Inc., agentic development is something we really pride ourselves on. Our internal development saying is festina lente: make haste slowly. Oxymoronic, yes, but in a poetic way. These two concerns are just as important, and they have to coexist.

The point is to use AI for the speed while shoring up the gaps AI opens. Speed is only valuable if the codebase survives the acceleration.

What better language to write most of our backend in than Rust?

Here is a real example from our Rust crates, trimmed to the shape:

// One declaration owns the HTTP status and the response body.
declare_jakuta_response! {
    pub Ok200 ExampleHealthResponse {
        message: ExampleHealthMessageText,
    }
}

That macro owns the response shape. The older escape hatch was letting authors compose derives and attributes by hand. That gives humans and agents too many places to be clever. The macro surface turns it into one declaration: name the response, name the fields, and let the crate generate the repetitive contract code.

The macro owns the item, the derives, the schema traits, the wire traits, and the constructor shape. The author does not get to quietly smuggle a bad attribute through the side door.

Human and Agent Proofing

The response-header case is a better example of the actual bug cycle:

// Header-writing responses use a separate macro with required slots.
declare_jakuta_response_with_headers! {
    Ok200 MyLoginResponse {
        body = MyLoginBody,
        headers = MySetCookieDirective,
    }
}

At one point, header-writing responses could be expressed through magic field names in the general response macro. That worked until it didn't: the body/header contract was implicit. The fix was to split header-writing into its own macro, where a header response has to name its body and headers explicitly. The grammar stopped letting the mistake exist.

The macros did not appear out of nowhere. They were built on top of hooks and gates that stop AI from writing unperformant code, runtime panic paths, and escape-hatch Rust that technically compiles while violating the project contract. The workspace lints deny unwrap, expect, panic!, todo!, dbg!, lossy casts, undocumented unsafe blocks, wildcard enum match arms, and a pile of other nonsense. jakuta-contract check backs the shared surface with a banned-shape catalog, drift checks, and contract self-audits.

Those shared crates show up across our backend work, from telemetry products to consumer-facing products. Developers and agents alike are forced toward the pit of success when they write Rust against those surfaces.

Whenever we patch a bug, we ask how the architecture allowed the bug to be written:

The macro was not opinionated enough.
The hook was not opinionated enough.
The automatic PR hooks and gates were not consistent enough.

That is the response-header lesson in miniature: fix the bug, then remove the shape that allowed the bug.

So now let's think from first principles again. One of the first things I learned when I was learning programming was DRY: don't repeat yourself. I still think it is one of the most important rules in programming, probably the most important one.

Why? Because you do not want to fix the same bug ten times. The eleventh copy always exists somewhere, and now you have another runtime issue to debug.

We have extrapolated that idea as a company. The target is the whole class of footgun across the company. One file or one codebase at a time is too small. A crate updates, Cargo pulls the new version, and the bad shape disappears everywhere that depends on it.

That is why this macro matters to us, and why the pattern matters more than this one macro. It starts with picking the right language. Rust lets us build these kinds of macro surfaces, and that matters because Rust is useful across almost every backend we care about: hardware-adjacent systems, game services, financial services, company products, and personal projects. Rust gives us a correctness-oriented language we can bet on for most of the work we do.

Then the hooks empower the AI to work inside that world in the first place. The hooks push the agent toward the macro. The macro makes the right thing the only thing to do. That is a far cry from the usual loop:

Check for the wrong thing after the fact.
Deny the wrong thing after it already became a pattern.
Advise against the wrong thing and hope the agent cares.

Executable Contracts

And we are still not done. If you are starting to pick up what I am putting down, the next objection is obvious:

"Hypothetically, what if the agent decides not to use the macro?"

Rest assured, the AI agent will find ways so nuanced and magical to do everything besides the thing you wanted that you almost want to give it a round of applause.

Hence: assume the AI is almost a bad actor. It will try to do the lazy, bug-ridden thing. So the next thing I did was make hooks that detect Rust projects and force the Jakuta contract CLI into the workflow. This is automated. There is nothing for an agent or a human to remember.

# Fix compile-blocking contract violations first.
jakuta-contract check --severity breaking --summary
 
# Group remaining Rust / TypeScript / Dart contract issues.
jakuta-contract check --targets rs,ts,dart --summary --group-by category
 
# Run the full contract gate.
jakuta-contract check --targets rs,ts,dart
 
# Discover the blessed surface before hand-rolling one.
jakuta-contract surface "wire body"
jakuta-contract usage declare_jakuta_wire_body
jakuta-contract why ApiRequest
 
# Compare the consumer contract against the current shared surface.
jakuta-contract whats-new --with-check --targets rs,ts,dart
 
# Check the file the agent is about to write.
jakuta-contract preflight src/new_handler.rs

So now the agent is compiling against Rust's opinionated compiler, running our test suites, and checking that it obeys the Jakuta contract. The CLI can tell it where the blessed surface lives, why that surface exists, what changed in the shared contract, and whether the new code is bypassing a shared abstraction.

This is where the human starts getting out of the tedious loop. Instead of me asking, "Did you use the shared libraries properly? Did you bypass something the shared crates already own?" the agent can ask the contract CLI and get an answer before I ever see the PR.

Every crate's README markdown file is parsed into the CLI surface. The same idea applies here: make the wrong thing impossible and make the right thing the only thing. This principle does not only protect against drift. It lets us scale agents toward more autonomy, because the documentation, generated catalog, and CLI surface are forced to move together.

Reasoning About Correctness by Construction in the AI Age

Before any of this works, you need a shift in thinking. I understand a lot of this may not be feasible depending on the company you work for, the codebase you inherited, or the project you are stuck inside. Fine. Assuming you see the value in this kind of process, the prerequisites are:

Assume the agent will not do anything it is not forced to do.
Write code for agents and humans at the same time. Users touch the behavior. Developers maintain the system. Agents will be asked to edit it and will absolutely misuse whatever loose surface you leave lying around.

Once those two ideas are in your head, the inevitable bug becomes a different kind of event.

Say an agent bypasses the response macro and hand-rolls a response type. The usual reaction is to add a check right there, or factor the repeated code into a reusable helper. That helps, but it is still too low. Work backward from first principles: why was this bad shape possible? What part of the API should be removed? What should become explicit instead of implicit? How far up the stack can we push the fix so this entire class of bugs disappears?

That gives us the actual loop:

Find an issue.
Work backward from first principles and ask why that issue was possible in the first place.
Categorize it into the broader class of issues it belongs to.
Refactor as far up the stack as you can to eliminate that class.
Encode the rule in a hook so the agent cannot casually reintroduce it.
Then ask what would make the hook fire less often in the first place.

After iterating on the hook, watch how agents try to weasel around it. If the same hook fires constantly, that is a hook smell. Maybe the context injected by the hook is weak. Maybe the skill or markdown file needs sharper wording. Often, in my experience, repeated hook fires mean I patched the code smell instead of the architectural reason the smell exists.

A good hook teaches you where the design is still leaking. If it fires all day, the codebase is still inviting the wrong move.

The answer can land at different layers: hook, macro, crate, maybe even a language change.

AI has made these huge refactors newly realistic. In the past, a senior engineer or tech lead could reasonably say a large-scale correctness refactor was out of scope because it would take too long. Now the mechanical rewrite is no longer the blocker it used to be. You still need judgment and taste. You still need somebody who knows what the system is supposed to do.

Some of our style rules started as hooks. Perfect Rustacean and Perfect TypeScripter both follow the same idea: block the shape at write time, then move the rule deeper when the pattern keeps recurring. If agents keep trying a ternary-shaped collapse or a boolean-shaped state machine, the hook blocks it. If the pattern belongs in a macro, the macro eventually makes the hook quiet.

That is where the CLI comes in. The agent can check itself, discover the right surface, read the rationale, and try again without waiting for a human to notice the same mistake.

AI Is Not Your Oracle

If you are reading this article, you probably already know agents can do more than act like glorified Google or a desktop assistant. Give them the right tools, and they can do a lot more than autocomplete a function or summarize a file.

Agents also are not only about cutting company costs. In certain domains, they can raise costs: more compute, more tool-building, more review, more infrastructure around the agent. That does not make them useless. It means you need to understand what they are and what they are not. One last thing to keep in mind while you build with them: the agent is only as useful as the environment you force it to work inside.

This feels a lot like the creation of compilers. Long ago, people were writing close to the machine day in and day out. When IBM worked on FORTRAN, one of the first high-level language compilers, the same kinds of arguments showed up. Not literal transcripts, just the shape of the complaint:

"It is inefficient."

"It is only for novices."

"It will never scale."

"It is a crutch."

"Real programmers will get worse if higher-level tools do the work for them."

Spreadsheets got their version too:

"A machine could never understand a complex ledger, business nuance, or all the little exceptions hiding in the work."

I suspect some of the modern AI pushback is cope from older senior developers, and honestly, I sympathize with it. Imagine spending years memorizing syntax just for AI to memorize syntax better than you ever could.

But the moat to programming was never syntax. The moat has always been system design, reasoning about state, understanding the hardware, and knowing what goal you are actually trying to execute on.

Now, like it or not, for better or for worse, sink or swim, understanding agents is becoming part of the craft. The job is to steer them, constrain them, and lead them directly toward the goal.

AI is not the oracle. I do not believe you can be a programmer while strictly relying on AI. We heard a version of that story with SQL too. SQL was readable enough that people thought everyone would be able to program. Look around. That did not make software engineering disappear.

The useful posture is: assume the AI will miss things, then surround it with hooks, macros, CLIs, compiler errors, and tests that pull it back into shape.

As a reward for reading this far, here is the small view of how we do this on the TypeScript side. Perfect TypeScripter is one of the production plugins we use at Jakuta Inc. The public shape is enough to show the point:

ban boolean collapse when the domain has more than two outcomes
reject optional-field clusters that should be discriminated unions
require exhaustiveness on discriminant switches
block any, unsafe casts, and stringly typed state where the domain wanted a real type
install cross-file ESLint rules for duplicate envelope shapes, variant-literal drift, and phantom type parameters

Some of those are style notes. Some of them are hard correctness rules. All of them come from things we have seen agents and humans do in production code. Anyone can have an agent write a post about a nice idea. This is the stuff we actually make ourselves and our agents live under every day.

Again, we all write code in our own ways. We all have our own desires, styles, and priorities. This is just how I like to do things, and how I have seen other developers I work with enjoy doing things as well. It has cut down on a lot of the back-and-forth and ambiguity that show up in any big project. If you relate to the issues in this article, or in Correctness by Construction, I think it is at least worth trying.

People complain about AI slop, and yes, there is a lot of it. We have all seen what it looks like in code by now. But let's not pretend slop code was invented by agents. Humans were writing plenty of it before the model showed up.

Throwing our hands up and saying AI agents are unusable, do not scale, or are not here to stay serves no one. The better move is to look at what is actually happening: agents are fast, agents are wrong sometimes, and the useful question is how much of that wrongness you can make unshippable.

Build the rails carefully. Push them forward every time you catch a fresh class of mistake. Models change. Prompts drift. The rails are yours.

If you haven't yet, read Correctness by Construction for the ladder of techniques this piece assumes. The Dijkstra line from over there still applies: if programming is the process of putting bugs in, then the AI age just gave you a faster way to insert them, and equally fast machinery for refusing to let them stay.

Next in this series, still being written: Cognitive Complexity, Artificial Intelligence and Programming.

Related in this cluster

Correctness by Construction
The ladder of techniques underneath this whole argument.
Why I'll Never Shut Up About Rust
The personal path that made me care about correctness in the first place.
Perfect TypeScripter
The public TypeScript side of the agent gates I keep talking about.
Haskell and CorrectnessStill being written
Why Haskell keeps showing up whenever programmers talk seriously about types and invariants.
Avoiding Vendor Lock-In with Agent ToolingStill being written
How to build agent workflows around portable contracts instead of one vendor's chat box.

Related in this cluster

Essays, ideas, and project notes.