Borrowed from Distributed Systems: Coordination Patterns for AI-Assisted Teams

Skip MarshallMarch 24, 20269 min read

If you've been reading along for the past two months, you might notice a pattern emerging.

The problems we've been describing — broken estimation, re-litigated decisions, semantic conflicts,the gap between speed and safety— they're not really product problems. They're not even process problems in the traditional sense. They're coordination problems. And coordination problems have known solutions. They just don't come from agile frameworks or product management playbooks. They come from an older, harder domain: distributed systems engineering.

Thanks for reading! Subscribe for free to receive new posts and support our work.

Here's the thing: your team with AI assistance is now functionally a distributed system.

You have multiple agents (humans and models) operating in parallel with partial information, making decisions asynchronously, sometimes conflicting, sometimes retrying, sometimes needing to reach consensus without a global clock. The familiar topologies from systems architecture — eventual consistency, consensus protocols, circuit breakers, shared state management — these aren't analogies anymore. They're literal descriptions of what's happening.

I've watched this unfold in three years of engagements where we've watched teams ship faster but also ship safer. And every single time, the teams that succeeded borrowed the same set of patterns that made distributed systems work at scale. Not all of them. Five specific patterns. Let me walk you through each, and then show you how they compose into a loop that actually manages the speed-safety tradeoff.

Pattern 1: Shared State (Your Context Graph)

In a distributed system, consensus is impossible without acanonical record.You need a single source of truth that every node can reference, even if they're working on stale copies locally.

In an AI-assisted team, that source of truth is almost never what it should be. It's scattered: an email thread, a Slack thread, a comment in a half-forgotten PR, someone's notion of what was decided three sprints ago.

When a human and an AI model disagree about what the product should do, where do you even go to find the truth? Nowhere. Which is exactly the problem.

Here's a real example from a fintech product we shipped last year. The team was building a transaction reconciliation feature. The designer had written one spec. The engineer had interpreted it differently. The AI model had been given yet another version as context. When edge cases started appearing in testing, nobody could agree on what the \"right\" behavior was because they were all reading from different sources. We lost two weeks to re-litigation of decisions that had already been made.

The fix: a single, structured, queryable record of the product's intended behavior. Every decision about what the product does, why, and under what constraints gets recorded in one place. Not Confluence. Not a wiki that nobody updates. A living, queryable graph that's the actual source of truth for product intent.

When the AI model asks \"what should this transaction do if both the reference number and the amount don't match?\" the team doesn't deliberate. Theyquerythe context. When a human reviewer flags a behavior as wrong, they check the context first. Is it a violation of the spec, or a misunderstanding of the spec? The answer is always available.

This isn't a documentation tool. It's a coordination tool. Every node in the system — human or AI — can synchronize to the same state without having to re-communicate about basics.

Pattern 2: Consensus Protocol (Your Decision Records)

Shared state solves \"what's the current truth?\" It doesn't solve \"how did we get here, and will we stick with it?\"

In a distributed system under failure scenarios (where systems degrade or provide inconsistent data), consensus protocols exist specifically to prevent conflicting decisions from propagating. BeforeRaft or Paxos, systems would diverge, contradict themselves, lose data.

An AI-assisted team has the same problem. A decision gets made in a standup. Two weeks later, when evidence suggests it was wrong, someone proposes revisiting it. But the original decision-maker is in a meeting, the context is fuzzy, and the AI model has already begun building downstream artifacts based on the original call. We litigate it all over again. Or worse — we make a new decision that contradicts the old one, and now the system is incoherent.

The fix: adecision record protocol. Not \"write down all decisions\" (that's documentation). A protocol where:

1. Decisions areproposedwith explicit proposer, timestamp, and rationale

2. They'reagreed toby stakeholders in a recorded fashion (not in chat)

3. They'retimestampedso causality is clear

4. They have explicitreview points(in 2 weeks, in a sprint, when we have data)

5. They'remutable, but only through the same recorded process

I watched a logistics startup do this last quarter. They had 15-person team. Decisions were being made in all-hands, Slack, 1-1s, code reviews. When bugs appeared, debugging involved reconstructing what was actually decided. They moved to a simple protocol: every decision on product behavior goes through a 24-hour review window. Anyone can propose. Anyone can comment. The proposer acknowledges feedback, updates the rationale, or withdraws. Then it's recorded with agreement timestamp.

This sounds like overhead. In practice, it eliminated decision churn. Because now when someone challenges a decision later, the original proposer can say \"here's why we chose this, here's who agreed, here's when we said we'd revisit.\" It's not personal. It's not political. It's just protocol.

And the AI model? It gets to see decision lineage. It doesn't just implement the current spec. It understandswhythe spec is the way it is, which turns out to be incredibly valuable context for handling edge cases.

Pattern 3: Service Contracts (Your Intent Contracts)

In a microservices architecture, you can't avoid the fact that Service A is going to call Service B, often through intermediaries, often with latency or partial failure. So you don't try to prevent that coupling. Instead, you make itexplicitandboundedthrough service contracts.

The contract says: \"Service A will send me a request with fields X, Y, Z, each validated against these constraints. I promise to return a response with fields A, B, C within 500ms on the p95. If I can't, I'll return HTTP 503. If you receive a 503, you should retry with exponential backoff after 2 seconds.\"

That contract is the interface. It's what both sides agree to. It liberates the implementation of both services. Service A doesn't have to know or care how Service B works. It just has to honor the contract.

In an AI-assisted team, work streams are dependencies. A human writes a functional spec. A model reads it and generates code. Another human reviews both. Each interaction is a place where intent can be lost, misunderstood, or contradicted. So you need explicit contracts.

An intent contract is a statement of what a piece of work is trying to achieve, in enough detail that a reviewer (human or AI) can evaluate whether it succeeded or failed,withouthaving to infer intent. Not \"build the reconciliation feature.\" But \"given a transaction with amount and reference number mismatches, return a ranked list of possible matches, ordered by likelihood, with confidence scores, within 200ms, using the existing fuzzy-match library only.\"

One of our engagements was a healthcare compliance product. A feature was handed off from design to engineering with the spec \"build a user preferences screen.\" That's not a contract. It's a title. Three weeks of rework later, they discovered the design had been thinking \"preferences that auto-save\" and engineering had built \"preferences in a modal with an explicit submit button.\" Total misalignment, zero visibility until QA.

With intent contracts, the design team would have written: \"provide a UI where users can toggle features on/off with zero-click feedback (state changes on toggle, persists to backend within 1 second, persisted state is guaranteed before user can navigate away).\" Now the engineer can't build a modal with a submit button. The contract forbids it. And they know itbeforebuilding.

This scales beautifully with AI. The model reads the intent contract. It can flag missing constraints before starting. It can test its own output against the contract. A human reviewer can check \"did the work satisfy the contract?\" instead of \"do I like this implementation?\"

Pattern 4: Circuit Breakers (Your Fortify)

A circuit breaker is a safety pattern in distributed systems. When a downstream service starts failing, a circuit breaker stops sending traffic to it,fails fast locally, and periodically probes to see if the service has recovered. This prevents cascading failures where one service's problems corrupt the entire system.

In an AI-assisted team, the \"downstream service\" might be an AI model, a human reviewer, an external API. When something is failing or degraded, you need a safety gate that prevents the failure from cascading deeper into your product.

This is where I see teams struggle the most. Because they confuse \"safety gate\" with \"approval process.\" An approval process slows everything down. A circuit breaker is the opposite — it isolates failures, prevents corruption, and lets good work flow through.

Here's a real scenario: your AI model is generating code. It's good 85% of the time, occasionally generates something subtly wrong that passes unit tests but breaks in production. Without a circuit breaker, that 15% failure rate cascades. Bad code reaches production. Users see bugs. Trust erodes.

A circuit breaker would say: \"run this code through static analysis, load-testing, and edge-case fuzzing. If it fails any of those, stop. Human review required. If it passes all three, ship it.\" Now you've isolated the failure mode. The 15% that would have broken in production gets caught before it matters.

I worked with a marketplace platform where AI was generating price recommendations. The system had no circuit breaker. A recommendation algorithm degraded (not enough data, but still running), and for two hours, vendors were getting suggested prices 10x higher than they should be, orders were being deferred, the whole feed mechanism was corrupted. If they'd had a circuit breaker — \"if p75 recommendation variance exceeds threshold, return null, fall back to human pricing\" — two hours becomes two minutes and affects zero orders.

A circuit breaker isn't friction. It's the opposite. It's the thing that lets you ship faster, because you're not protecting against failure with more process. You're isolating it with a gate.

Pattern 5: Observability (Your Telemetry)

In distributed systems, you cannot manage what you cannot measure. A request flows through 15 services. One of them is slow. Which one? Without instrumentation, you're flying blind.

The same is true for AI-assisted teams. You're shipping faster. You're making decisions with partial information. You're using AI models that sometimes hallucinate. If you can'tseewhat's happening — where decisions are being made, who's making them, where AI confidence is high vs. low, where humans are overriding models, where models are catching human mistakes — you're going to have a broken system and not know why.

Real telemetry means: every decision recorded. Every AI-assisted action logged with the input, the output, the confidence, and the human review outcome. Which kinds of features ship faster? Which kinds regress more often? Where are reviews happening and why? Where are decision records being revisited, and what triggered the revisit?

With good telemetry, you can answer: \"our AI model is generating code reviews with an 89% accuracy rate on security issues, but only 67% on performance issues. Let's retrain on performance.\" Or \"we're re-litigating pricing decisions 4x more often than any other decision type. Let's improve the decision protocol for pricing.\"

I've seen teams measure the wrong things. Deployment frequency (vanity metric). Time in code review (broken incentive). They end up optimizing for the metrics, not for actual delivery quality.

The right telemetry tells you: are we shipping safer? Are we reducing re-work? Are humans and AI models actually coordinating, or just pretending to?

The Loop

These five patterns don't live in isolation. They compose into a loop.

You build ashared state(context graph) that serves as the source of truth. When someone wants to change it, they propose through aconsensus protocol(decision records). Those decisions establishexplicit contracts(intent contracts) for the work that follows. Work gets executed undersafety gates(circuit breakers) that catch failures before they cascade. And the entire system is instrumented withobservability(telemetry) so you can see what's actually happening, learn from it, and improve the protocol itself.

The human doesn't manage the AI. The AI doesn't manage the human. They both synchronize to the shared state, make decisions through recorded protocols, execute under bounded contracts, fail safely, and measure continuously. The system manages itself.

This isn't more process. It's actually less process than traditional Agile, because you've removed the coordination overhead. Fewer meetings. Fewer Slack threads. Fewer surprises. Instead of \"let's talk about this,\" you query the shared state. Instead of \"did we agree on this?\" you check the decision record. Instead of \"I hope the engineer understood my intent,\" you have a contract. Instead of \"oh no, the model broke something,\" you have a circuit breaker.

It's more systems thinking. Less organizational friction.

Why This Works

If you're building software with humans and AI, you're not in a startup problem space anymore. You're in a systems problem space.

The topology is different, but the coordination challenges are identical to what distributed systems engineers solved 15 years ago.

Borrowed patterns beat invented solutions.

Every team I've watched implement these patterns — shared state, consensus protocol, service contracts, circuit breakers, observability — has shipped faster and safer. Not because they're working harder. Because they've stopped re-coordinating. The system remembers. The system synchronizes. The system fails gracefully.

Next week, Skip's going to talk about why this all falls apart if your incentives are wrong. Spoiler: there's a $50k prototype trap that eats good patterns for breakfast.

If this resonated, subscribe.

We're writing about what's actually changing in software delivery — no hype, no hand-waving, just what we're seeing with real teams on real products.

Written by Skip Marshall

Learn more about our team

More Insights

← Back to InsightsSubstack