
This is the sixth and final article in "How We Actually Build," a series where Skip and I open up the CRAFT methodology one layer at a time. Three months in, the honest retrospective on running it.
By Skip Marshall & Chuck Griess
Last Wednesday, Chuck and I sat down to look at our four active CRAFT client engagements side by side. We had a spreadsheet open. The columns were the five CRAFT components. The rows were the projects. Each cell was a one-to-five score for how cleanly that component was running on that engagement.
We were not expecting the picture we got.
Two engagements were running the system end-to-end and getting the outcomes we'd hoped for: tighter shipping cadence, fewer re-litigated decisions, faster client onboarding when new people joined the team. One engagement was running three of the five components well and the other two barely at all, and was producing surprisingly good outcomes anyway. The fourth was running all five components by the book and producing outcomes that were, frankly, mediocre.
The system is not a switch. It is a dial, and the dial has multiple knobs, and the relationship between "how much CRAFT is running" and "how well the project is going" is not a straight line.
That is what I want to write about today. Not the polished version of "here is the methodology working." The actual version of "here is what three months of running CRAFT looks like, including the parts we did not predict."
If you read our Field Notes piece from Series 1, about the four-person team that replaced a twelve-person org, you saw a version of this honesty before. Field Notes was a snapshot of one engagement on a Tuesday. This is the same posture, applied to four engagements over a quarter.
When Skip pulled me into that retrospective, I came in expecting the data to make a clean case for the methodology. It did not.
What the data showed, instead, was three patterns.
The first pattern: the components do not contribute equally. Across our four engagements, two CRAFT components correlate strongly with the outcomes we care about. The Intent Contract correlates with everything. Projects that have a tight Intent Contract at the start ship cleaner, even when other components are weaker. And the Fortify Gate correlates with the absence of production incidents, which is the thing it was designed to do, so that one is less surprising.
The Context Graph and the Decision Record both contribute, but they contribute in compounding rather than immediate ways. A team running these for six weeks looks similar to a team not running them. A team running these for three months looks markedly different. The benefit is real but it is not visible in a sprint cycle. That has implications for how we sell the methodology, and for how patient we have to ask clients to be.
The second pattern: the system tolerates partial adoption better than we expected. The engagement running three of five components well is doing well because those three components happen to be the ones that matter most for that particular project. They are doing a discovery-heavy build where the Intent Contract is everything, they are paranoid about production stability so the Fortify Gate is locked down, and they are heavily AI-leveraged so the Context Graph is non-negotiable. The Decision Record discipline is loose and the boundary calibration is informal, and that is fine for their stage and risk profile.
The third pattern, and this is the one I did not see coming: the engagement running everything by the book and producing mediocre outcomes is being held back not by the methodology, but by a team-level dynamic that CRAFT did not surface. Specifically, the team has the artifacts but does not have the conversations the artifacts are supposed to force. They write Intent Contracts that nobody pushes back on. They capture Decision Records that nobody questions. They maintain a Context Graph that nobody updates with the controversial stuff. The form is there. The function is not.
That is the failure mode I want to flag most explicitly, because it is the one a team adopting CRAFT is most at risk of stumbling into.
The client-facing version of this looks different from the internal data, and it is worth telling that side of it too.
Three things show up in client conversations consistently now, three months in.
The first is the shift from vibe-based delivery commitments to evidenced ones. "When does this ship" used to be answered with a confidence-weighted guess. It is now answered with "it ships when it passes the Fortify Gate, and based on what is open, that looks like Tuesday." Founders and product leads notice this shift faster than I expected. The thing they say, almost word for word, is some version of "I trust your timelines more now."
The second is that the Intent Contract is doing a job we did not initially design it for. We built it to force clarity at project start. What it is also doing is creating an artifact the client can hold us accountable to mid-project, which keeps both sides honest about scope creep without it turning into a negotiation. When a stakeholder asks for something that is not in the Intent Contract, the conversation shifts from "yes or no" to "do we want to amend the contract, and what are we trading for it." That is a much better conversation.
The third is the one I find most interesting. Two of our four clients have started asking whether they can run pieces of CRAFT on their internal teams, separate from our engagement. They want to use the Intent Contract format with their own product managers. They want to adopt the Fortify Gate for their internal releases, not just ours. That is a signal we did not expect this early. It is not, by itself, validation that the methodology works. But it is a signal that practitioners on the client side see enough value to want to take it home.
In the spirit of building in public: here is the list of things I do not have clean answers to yet.
The first is the maintenance burden. The Context Graph and the Decision Record are both living artifacts, and the cost of keeping them current is real. On the engagements where one engineer has taken clear ownership, currency stays high. On the engagements where ownership is distributed, currency drifts. We do not yet have a model for how to make this work at higher team sizes, because all four of our engagements are still small. I do not know what CRAFT looks like on a team of fifteen, and I want to be honest that I do not know.
The second is the calibration of the boundaries layer. The Context Graph has a section that defines where AI autonomy is high, medium, and low. Setting those boundaries correctly requires a level of architectural judgment that varies sharply by team. The seasoned engineers do it well intuitively. The teams that need the most help with calibration are the same teams that have the least experience to draw on. We have not solved that. The current workaround is that we do the initial calibration with the client and they refine it over time, but that does not scale to teams we are not directly working with.
The third is the part of the system that is supposed to evolve. CRAFT itself was designed to be revised. We have not revised it yet. Some of that is because three months is not enough time to know what wants revising. Some of it is because we built it with intent and we are reluctant to touch it without strong evidence. I am watching myself to make sure that reluctance does not curdle into the same "stop re-litigating decisions" rigidity we wrote about in the original CRAFT piece, applied to the methodology itself.
If we were sitting across from a founder or a head of engineering who said "we want to adopt this," here is what I would actually say.
Do not adopt all five components on day one. Pick the two that map to your sharpest current pain. If you are losing time to scope creep and re-litigated decisions, start with the Intent Contract and the Decision Record. If you are losing time to production incidents and unreliable rollbacks, start with the Fortify Gate. If you are leveraging AI heavily and getting inconsistent output, start with the Context Graph. Get those two running cleanly for six weeks before adding the next component.
Watch for the failure mode Chuck flagged. The artifacts are not the methodology. The conversations the artifacts force are the methodology. If you find yourself writing an Intent Contract that nobody disagrees with, you have not yet started running CRAFT. You are just generating documents. The signal that you are running the system is friction, productive friction, on the things that used to slip through.
Pick an owner per component. Not a team. A name. The Decision Record owner. The Fortify Gate runner. The Context Graph maintainer. Diffuse ownership is the failure mode that ate two of our four engagements until we corrected it. One person per artifact, with the authority to push back on the team when the artifact is being neglected.
Be patient with the compounding components. The Decision Record and the Context Graph will not feel valuable for the first four to six weeks. They start feeling valuable around week eight, when you reach back into the record to settle a debate and realize the debate would have cost you a week if the record was not there. If you measure these components in their first month, you will quit them. Do not measure them in their first month.
And finally: do not treat CRAFT as a thing to install. Treat it as a thing to grow into. Borrowed from Distributed Systems, the piece we wrote in Series 1, made the case that small teams in the AI era look more like distributed systems than like the small teams of the previous decade. CRAFT is the operating model for that system. You install an operating system once. You grow into one over months.
A few people have asked us, in the last few weeks, whether we are going to package CRAFT into a product. The honest answer is: not yet, and probably not the way you would expect.
We are going to keep running it. We are going to keep writing about it. We are going to keep adding clients to the cohort that uses it, and we are going to keep being public about what works and what does not. At some point there will be a version of it that is mature enough to be a product, but if we ship that version before we have run it on enough engagements at enough team sizes, we will end up selling the polished form to people who need the actual function. We have watched too many methodologies make that exact mistake.
For now, the artifact we are leaving you with at the end of this series is not a tool. It is a self-assessment. The CRAFT Maturity Model. It is the same lens Skip and I used last Wednesday when we sat down with that spreadsheet. You can use it on your own team, in about thirty minutes, to figure out where you actually are on each CRAFT component, and where the highest-leverage next step lives.
For each of the five CRAFT components below, score your team from 1 to 5 using the rubric. The point of the assessment is not to score high. It is to surface where the lowest-leverage component is, so you can pick a single, focused next step.
Tuned for a three-to-five person team. Larger teams can use it but should expect the maintenance discipline thresholds to be more demanding.
Total below 10: You are pre-CRAFT. Pick one component to start with based on your sharpest current pain, and run it for six weeks before adding anything else.
Total 10–15: You are running the system partially. Find the component with the lowest score and ask whether that gap is intentional, given your risk profile, or accidental. If accidental, that is your next focus.
Total 16–20: The system is running. Your next leverage is probably in calibration: tightening the Intent Contract, refining the Fortify Gate, getting the Context Graph closer to currency. The components are in place. The function is improving.
Total 21–25: You are running the full system. Use the assessment quarterly to catch drift. The biggest risk at this maturity is complacency, where the artifacts stay current but the conversations they were designed to force start to soften.
The scores are less important than the conversation the scores force. If two people on the same team score the same component differently, that gap is the most useful data point in the exercise. It means the artifact exists, but the team's relationship to it does not yet match. That gap is what you fix next.
Written by Skip Marshall & Chuck Griess
Learn more about our team