I Run 4 AI Agents in Parallel. Here's How I Keep Them from Destroying My Codebase

Last week I had four Claude sessions open, each working on a different feature for my app. Session 1 was refactoring the shopping list. Session 2 was adding pantry management. By the time I looked back at Session 1, it had helpfully “fixed” all of Session 2’s changes.

I tried cloning the repo multiple times—ended up with folders named kinmel1, kinmel2, kinmel-final, kinmel-final-FINAL. It worked, but I spent more time cleaning up duplicate repos than coding.

Then I discovered git worktrees could give each AI its own isolated workspace. No conflicts. No cleanup. Just parallel development that actually works.

Open Table of contents

The First Problem: AI Amnesia
CLAUDE.md: Your AI’s Operating Manual
The Second Problem: AI Can’t Run Code (Well)
The Testing Nightmare
- What Actually Works
- The Mock Hell Problem
Running 4 AIs Without Them Destroying Each Other
Does This Actually Work?
What I’m Still Figuring Out
The Uncomfortable Truth

The First Problem: AI Amnesia

Every new Claude session is like hiring a brilliant developer with complete amnesia. You explain your Django patterns, your specific way of organizing services, why you chose Django Ninja over DRF. Claude gets it perfectly.

Tomorrow’s session? “Have we met? Oh, Django? Let me refactor this to FastAPI, it’s much cleaner.”

Even within a session, things drift. I once watched Claude slowly migrate my error handling from exceptions to result types over the course of an hour. Elegant? Sure. What I asked for? Absolutely not.

CLAUDE.md: Your AI’s Operating Manual

I started with a simple CLAUDE.md file after Claude added docstrings to every single method for the fifth time. “No comments unless asked” became rule #1.

## 📋 User's Preferred Workflow
1. Read existing code before writing new
2. Follow existing patterns exactly
3. Ask before architectural changes
4. Keep changes minimal - don't refactor unless asked
5. No comments in code unless explicitly asked

## 📚 Required Reading
Go read these files first:
- API_REFERENCE.md - All endpoints and schemas
- ARCHITECTURE.md - Service boundaries
- DATA_MODEL.md - Database schema
- FEATURES.md - Business logic

## 🗂️ Key Locations
- Backend: backend/apps/*/api.py, */models.py, */services.py
- Frontend: frontend/src/components/, src/services/api/

My first version was 2000 lines. Claude ignored most of it. Now I keep it under 50 lines—just enough to prevent the most annoying behaviors.

I’m still not sure if organizing by file paths actually helps or if Claude just pretends to follow them. But the “no comments” rule? That one sticks.

The Second Problem: AI Can’t Run Code (Well)

Claude would write beautiful code that didn’t run. Missing imports. Wrong types. Functions that didn’t exist. I’d spend 20 minutes fixing what took Claude 2 minutes to write.

Then I added validation tools. Now when Claude writes code, ruff immediately screams about the imports. Mypy catches the type errors. The pre-commit hooks catch everything else.

# .pre-commit-config.yaml
- repo: https://github.com/astral-sh/ruff-pre-commit
  hooks: [ruff --fix, ruff-format]
- repo: https://github.com/pre-commit/mirrors-mypy  
  hooks: [mypy]

Claude still writes broken code. But now it fixes it immediately instead of me discovering it later.

The Testing Nightmare

Yesterday Claude wrote 47 tests for a simple shopping list feature. It spent an hour getting them all to pass. The actual feature? Broken. The tests? They were testing Claude’s mocks, not my code.

The worst part: Claude kept modifying my production code to make the tests pass. “I’ll just update the service to match what the test expects.” No. That’s backwards.

What Actually Works

I tell Claude to write exactly 5 tests. No more. The critical paths only. When it starts writing the sixth test, I hit ESC.

"Write tests for OrderService following tests/test_shopping_service.py"

For the frontend, I sketch rough UIs and have Claude build with shadcn/ui components. At least when those break, they break consistently.

The Mock Hell Problem

Claude loves mocks. It’ll mock the database, mock the API, mock the mock of the mock. Then write elaborate tests proving the mocks work.

I caught Claude once writing a test that mocked the shopping service, then tested that the mock was called correctly. The actual shopping service? Never touched. The test passed beautifully.

Running 4 AIs Without Them Destroying Each Other

Here’s what actually happened when I first tried parallel AI sessions:

I had four Claude windows open. Shopping list refactor in window 1. Pantry feature in window 2. API updates in window 3. Frontend in window 4. Came back from lunch—they’d all been “helpfully” fixing each other’s code. Complete chaos.

My Failed Attempts

Attempt 1: Just be careful about which files each session touches. Failed immediately. Claude loves to “improve” nearby code.

Attempt 2: Clone the repo multiple times. Worked, but I had folders everywhere: kinmel1, kinmel2, kinmel-final, kinmel-final-FINAL, kinmel-actually-final. Spent more time managing repos than coding.

Attempt 3: Git worktrees.

Git Worktrees: The Actual Solution

Each Claude session gets its own isolated workspace:

git worktree add ../kinmel-models feature/models
git worktree add ../kinmel-service feature/service
git worktree add ../kinmel-api feature/api
git worktree add ../kinmel-frontend feature/frontend

Now I can run 4 Claude sessions simultaneously. Each one only sees its own branch. No conflicts. No stepping on each other.

Window 1 works on models. Window 2 builds services. They literally can’t see each other’s changes until I merge the PRs.

The Reality Check

I don’t actually throw away 1 in 5 PRs like I originally claimed. Looking at my Kinmel repo, every single PR got merged. But plenty needed fixes:

PR #7: “Fix biome issues” - Claude broke the linter
PR #10: Another biome fix - Claude broke it again
PR #3: Had to add an endpoint Claude forgot

The truth is messier: I merge everything, then spend time fixing what’s broken. Is this faster than doing it myself? On balance, yes. But it’s not the clean “throw away the bad PRs” story I wanted to tell.

ESC Key Saves Lives

My most-used key combination: ESC. Claude starts refactoring something unrelated? ESC. Writing its 30th test? ESC. Adding helpful comments everywhere? ESC.

I probably hit ESC 20+ times per session. It’s like having a brake pedal for runaway AI.

Does This Actually Work?

I built my Kinmel app (Django + Next.js SaaS) using this system. 15 PRs, all merged. Did it quadruple my velocity? Honestly, I don’t know—I’ve never built Kinmel without AI to compare.

What I do know:

I ship features faster than when I code alone
I spend less time on boilerplate
I fix more AI mistakes than I’d like
The code follows my patterns (mostly)

The 3-4 features per day claim? That depends what you call a “feature.” Adding a pantry module with full CRUD? Yes. Building a complete authentication system? No.

What I’m Still Figuring Out

Documentation balance: How much is too much? My CLAUDE.md keeps growing, but Claude seems to ignore anything past line 50.

Test strategy: Should I let Claude write any tests? Half the time it’s testing its own mocks. But manual testing everything isn’t sustainable either.

PR review overhead: Reviewing 4 AI-generated PRs takes serious time. Sometimes I wonder if I’m just moving work from coding to reviewing.

Technical debt: Am I accumulating debt faster than I realize? Ask me in six months when I’m maintaining this code.

The Uncomfortable Truth

This system works for me because I can recognize when Claude goes off the rails. If you’re new to Django or Next.js, parallel AI sessions might create more problems than they solve. You need to know enough to spot the nonsense quickly.

But if you do have that experience? Running parallel AI sessions with git worktrees genuinely multiplies your output. Not 4x. But definitely more than 1x. And sometimes that’s enough.