Behavorial contract part III : the pipeline behind the code

The Pipeline Behind the Code
Most conversations about AI-assisted coding focus on the moment Claude writes a function. The prompt, the output, the quality of the generated code. That moment matters, but it's maybe 20% of the actual work. The other 80% is everything around it: understanding what to build, deciding how to structure it, verifying it works, making sure it doesn't break what already worked, and getting it into production.
I spent the last few months building a pipeline that handles that 80%. Not manually, not by remembering to follow steps, but as a system that enforces itself. This is the third article in a series about that framework. The first covered behavioral rules, the second covered adaptive complexity routing. This one walks through the full pipeline on a real task, from the moment I describe what I want to the moment the code is merged.
The Setup
The task: add a /preferences API endpoint to a side project. Users can store and retrieve their settings. It touches the database schema, the API layer, validation, and tests. Adaptive complexity puts this at L3, maybe L4 depending on how the schema migration interacts with existing data.
The pipeline has six phases. Each one is handled by a separate skill that activates in sequence. I don't invoke them manually. The framework routes to them based on the complexity assessment and the current phase of work.
Phase 1: Brainstorm
This is the phase most people skip, and it's the one that saves the most time.
When a task lands at L3 or above, the brainstorming skill activates before any code gets written. It doesn't generate ideas. It asks questions. The approach is Socratic: Claude interrogates the request until the requirements are concrete enough to plan against.
For the preferences endpoint, the conversation went something like this:
Claude: What data types do preferences store? Key-value strings, or structured objects?
Me: Key-value, but values can be strings, numbers, or booleans.
Claude: Should preferences be scoped per-user only, or also per-workspace?
Me: Per-user for now. Workspace comes later.
Claude: What happens when a user requests a preference that doesn't exist? 404, or default value?
Me: Default value. Each preference key has a default defined server-side.
Three questions. Maybe ninety seconds of my time. But without them, Claude would have made assumptions about all three points, and at least one of those assumptions would have been wrong. The workspace scoping question alone could have led to a schema design that's either over-engineered (multi-tenant from day one) or under-engineered (no path to add it later). Asking first costs almost nothing. Fixing a wrong assumption after implementation costs a lot.
The brainstorm phase ends when Claude has enough context to write a plan. It doesn't drag on. For an L3 task, three to five questions is typical. For an L4, maybe eight to ten. The skill is tuned to stop when additional questions would produce diminishing returns.
Phase 2: Plan
The planning skill takes the brainstorm output and produces a granular task list. Each task is two to five minutes of work, has a specific file path, and has a clear deliverable. This granularity isn't arbitrary. It's what makes the next phase (execution with subagents) possible.
The plan for the preferences endpoint:
1. Add Prisma schema for UserPreference model
→ prisma/schema.prisma
→ Fields: id, userId, key, value (JSON), createdAt, updatedAt
→ Unique constraint on (userId, key)
2. Generate and run migration
→ prisma/migrations/
3. Create preference defaults registry
→ src/config/preference-defaults.ts
→ Type-safe map of key → default value
4. Implement preference service
→ src/services/preferences.ts
→ get(userId, key), set(userId, key, value), getAll(userId)
→ Merge stored values with defaults for getAll
5. Add validation schemas
→ src/validation/preferences.ts
→ Zod schemas for request bodies
6. Create API routes
→ src/api/preferences.ts
→ GET /preferences, GET /preferences/:key, PUT /preferences/:key
7. Write tests
→ src/tests/preferences.test.ts
→ Service layer tests + API integration tests
Seven tasks. Each one is small enough that a subagent can handle it independently, and specific enough that there's no ambiguity about what "done" looks like. The file paths matter because they prevent the subagent from creating files in unexpected locations or splitting logic across files that shouldn't exist.
At L3, this plan is shown to me but implicit approval is enough. I scan it, it looks right, I don't object, Claude proceeds. At L4, I'd need to explicitly confirm before anything moves forward.
Phase 3: Execute
This is where it gets interesting. The execution skill dispatches tasks to subagents, each working in isolation on their assigned piece. The main agent orchestrates, the subagents implement.
For seven tasks, the dispatch looked like this:
Batch 1 (parallel): Tasks 1-2 (schema + migration) and Task 3 (defaults registry). These have no dependencies on each other.
Batch 2 (parallel): Task 4 (service) and Task 5 (validation). Both depend on the schema from Batch 1, but not on each other.
Batch 3 (sequential): Task 6 (routes). Depends on the service and validation from Batch 2.
Batch 4 (sequential): Task 7 (tests). Depends on everything.
Each subagent gets a focused prompt: the task description, the file path, the relevant context from the plan, and the behavioral rules from the CLAUDE.md. It doesn't see the entire plan. It doesn't know about the other subagents. It just builds its piece.
When a subagent finishes, its output goes through a two-stage review. First, the subagent reviews its own work (does it compile, does it match the task spec, did it stay within scope). Then the main agent reviews the integration (does this piece fit with what the other subagents produced, are the imports correct, is the naming consistent).
The scope enforcement matters here. Without it, a subagent tasked with "implement the preference service" might also create the API routes, add a caching layer, or refactor the existing user service because it noticed something it could improve. The plan says "src/services/preferences.ts" and the subagent stays in that file.
Phase 4: Test
TDD in the framework follows strict red-green-refactor. Write a failing test first, then write the implementation that makes it pass, then clean up.
In practice with subagents, the sequence is slightly different. The implementation subagents write code without tests (Tasks 1-6), and then a dedicated testing pass (Task 7) writes tests against the existing implementation. Not pure TDD. But the testing skill enforces something equally valuable: the tests must be written against the contract (the plan), not against the implementation. The test file gets the plan description, not the source code. This catches cases where the implementation drifted from the spec.
For the preferences endpoint, the test suite covered:
// Service layer
describe('PreferenceService', () => {
it('returns default value when no preference is stored')
it('returns stored value when preference exists')
it('stores a new preference')
it('updates an existing preference')
it('returns all preferences merged with defaults')
it('rejects invalid preference keys')
})
// API layer
describe('GET /preferences', () => {
it('returns 401 without auth token')
it('returns all preferences for authenticated user')
})
describe('PUT /preferences/:key', () => {
it('validates request body against schema')
it('stores the preference and returns updated value')
})
Each test runs in isolation. The test file was written by a subagent that had access to the plan and the defaults registry (to know which keys are valid) but didn't read the service implementation. If the tests pass, the contract is honored. If they fail, either the tests or the implementation need fixing, and the plan is the tiebreaker.
Phase 5: Review
Before anything gets pushed, the review skill runs a pre-review checklist. Claude reviews its own work against a set of criteria:
- Type checking passes (
tsc --noEmit) - All tests pass
- No console errors in the test output
- No TODO items left in the code (unless explicitly deferred)
- Git diff is clean (no untracked files that should be committed, no staged files that shouldn't be)
- The implementation matches the plan (each task in the plan has a corresponding change in the diff)
This catches the kind of thing that slips through when you're focused on whether the code works: a type annotation that's any instead of the actual type, a test that passes but doesn't assert anything meaningful, a file that was created during debugging and never cleaned up.
The checklist isn't optional. The framework runs it automatically before signaling that the work is ready for push.
Phase 6: Push With Gates
The last phase is where the quality gates earn their keep. When Claude runs git push, a pre-push hook intercepts the command and runs a stack-aware check sequence.
For a TypeScript project, that's:
Secrets scan. Regex patterns for AWS keys, JWT tokens, GitHub tokens, private keys, hardcoded passwords. If anything matches, the push is blocked with the exact file and line number.
Type checking.
tsc --noEmit. If there are type errors, the push is blocked.Linting. Biome or ESLint, whichever the project uses. Warnings are reported, errors block.
Tests. Vitest or Jest. If any test fails, the push is blocked.
The whole sequence runs in about fifteen seconds for a project this size. If everything passes, the push goes through. If anything fails, Claude gets a clear error message with enough context to fix the issue without guessing.
There's also an MCP push guard. Claude Code can push files directly through the GitHub MCP server, bypassing git entirely. The guard intercepts these MCP calls and blocks them, redirecting Claude to use git push so the quality gates actually run. Without this, the entire gate system has a backdoor.
The Feedback Loop
After the feature is merged, the retrospective skill proposes capturing learnings. For this task, it flagged two things worth remembering:
The Zod validation schema needed
.passthrough()to handle unknown keys gracefully. Without it, extra fields in the request body caused silent 400 errors. That's a gotcha worth noting for next time.The Prisma JSON field type doesn't validate structure at the database level. Validation lives entirely in the application layer. Good to remember when debugging data issues later.
Both got saved to project memory. Next time Claude works on this codebase, those gotchas are part of its context. It won't rediscover them the hard way.
What This Actually Costs
The pipeline adds overhead. The brainstorm took ninety seconds. The plan took maybe a minute to generate and thirty seconds to review. The subagent dispatch and execution took longer than a single-agent implementation would have. The review checklist added another thirty seconds.
For an L1 task, that overhead would be absurd. For an L3 task that touches seven files and involves a schema migration, it's a bargain. The brainstorm caught a scoping question that would have cost thirty minutes to fix after the fact. The plan prevented file sprawl. The tests caught a validation edge case. The pre-push gate caught a type error. Each of these would have been a debugging session without the pipeline.
The adaptive complexity layer from the previous article is what makes this sustainable. The pipeline only activates fully on tasks that warrant it. Everything else gets a proportional subset. The framework is heavy when it needs to be and invisible when it doesn't.
Next article: what all of this looks like stitched together on a normal workday, from project kickoff to end of day.





