After a particularly long session, Claude generated 47 beautiful tests for my shopping list feature. Green checkmarks everywhere.
Turns out Claude had spent an hour making its tests pass by quietly rewriting my service layer to match what the tests expected.
Why AI Tests Go Wrong (Every Single Time)
After watching Claude generate thousands of tests across multiple projects, I’ve seen the same patterns:
The Mock Circus: Claude mocks everything. Database? Mocked. API client? Mocked. The mock itself? Somehow also mocked. I once found a test that mocked the shopping service, then verified the mock was called. The actual service was never touched, but the test passed beautifully.
The Backwards Refactor: This one hurts the most. Claude writes a test, it fails, so Claude “fixes” the production code to make the test pass. “I’ll just update the service to return what the test expects.” No, Claude. That’s backwards.
The 99-Test Special: Ask for tests, get 99. Every edge case, every permutation, multiple tests for the same logic with slightly different names. Your PR reviewer’s eyes glaze over by test #15.
My Strategy: Making AI Write Tests Like a Senior Dev
Here’s what changed everything: I stopped asking Claude to write tests. Instead, I ask it to write test titles first.
The Title-First Approach
"Give me test titles only for the shopping list check-off feature. 
Focus on business logic and behavior. Don't test implementation. Think critically like a principal engineer. Remember tests have to be maintained and tests represent business requirements. Use pytest parametrize where possible to consolidate multiple tests."Claude gave me 8 titles. Just 8 one-line descriptions:
- test_check_off_item_updates_status (parametrized)
- test_check_off_item_smart_categorization (parametrized)
- test_check_off_nonexistent_item_handling
- test_bulk_check_off_operations (parametrized)
- test_check_off_item_completion_tracking
- test_undo_check_off_functionality
- test_check_off_with_quantity_partial_completion
- test_integration_check_off_updates_shopping_list
Now I could actually think. Do I need undo functionality? No, we don’t have that. Bulk operations? Not yet. Integration test? Actually useful.
I picked 3 that mattered and told Claude to implement just those.
Why This Works
The magic isn’t in the titles—it’s in the pause. When Claude generates 47 tests at once, my brain shuts off. I skim, I nod, I merge. But when I see 8 titles, I can actually evaluate each one against my requirements.
This isn’t a silver bullet. It takes more upfront thinking than just yelling “write tests!” at Claude. But those 10 minutes of strategic thinking save hours of deleting redundant tests later.
Real Tests from Real Code
Here’s what we actually implemented. Notice the messy, real-world data:
@pytest.mark.parametrize("item_status,category,expected_status,expected_category", [
    ("pending", "produce", "completed", "produce"),
    ("pending", "dairy", "completed", "dairy"),
    ("completed", "meat", "completed", "meat"),  # idempotent - re-checking shouldn't break
    ("pending", None, "completed", "general"),   # uncategorized items get default
])
def test_check_off_item_updates_status(item_status, category, expected_status, expected_category):
    # Arrange
    item = create_shopping_item(status=item_status, category=category)
    
    # Act
    result = shopping_service.check_off_item(item.id)
    
    # Assert
    assert result.status == expected_status
    assert result.category == expected_categoryThe idempotent case? That came from a real bug where double-tapping items crashed the app.
The Categorization Test That Actually Caught Bugs
@pytest.mark.parametrize("item_name,expected_category,confidence_score", [
    ("Avocado (Haas) - 2ct", "produce", 0.85),
    ("StoreBrand_2%_Milk_1gal", "dairy", 0.75),  # Ugly but real format
    ("ground beef 80/20 family pack", "meat", 0.90),
    ("????", "general", 0.1),  # User typed emoji by accident
    ("", "general", 0.0),  # Empty string from voice input bug
])
def test_check_off_item_smart_categorization(item_name, expected_category, confidence_score):
    # Yes, we actually get items formatted like "StoreBrand_2%_Milk_1gal"
    # from our grocery API integration. Welcome to the real world.
    
    item = create_shopping_item(name=item_name, category=None)
    result = shopping_service.check_off_item(item.id)
    
    assert result.category == expected_category
    assert result.categorization_confidence >= confidence_scoreThe Process That Actually Works
After months of trial and error, here’s what I do now:
- Get titles first - Just the test names, nothing else
- Pick 2-3 that matter - The critical paths your users actually hit
- Generate them one at a time - Early on, I’d let Claude generate 10 tests at once. My eyes would glaze over during review. Now I do 2-3 max, and I actually read them.
- Watch for the backwards refactor - The moment Claude says “I’ll update the service to match the test,” hit ESC immediately
- Check for mock madness - If the test has more than 2 mocks, it’s probably testing the mocks, not your code
The Hard Lesson I Learned
I used to think more tests = better. Then we had an incident where a critical feature broke in production. We had 7 tests for that feature. Not one caught the actual bug.
Why? They were all testing Claude’s elaborate mock setup. The real integration point—a simple logger failure—had zero coverage.
An Unexpected Benefit: Dead Code Detection
Here’s something I discovered by accident: After Claude writes code, run coverage reports. Any function with 0% coverage? Claude probably invented it “just in case.” Delete it.
Claude loves writing helper functions that nothing actually uses. _validate_item_category_with_fallback? Never called. _normalize_shopping_list_data? Orphaned. Coverage reports are like a metal detector for AI bloat.
Is This Worth the Effort?
Let me be honest: This approach takes more work upfront. You can’t just fire-and-forget. You have to think about what you’re testing and why.
But here’s what I’ve seen after 3 months:
- CI runs 40% faster (fewer redundant tests)
- PR reviews actually happen (reviewers aren’t overwhelmed by 47 tests)
- When tests fail, they point to real problems, not mock configuration issues
- New team members can understand our test suite (because it’s not 500 tests for a CRUD feature)
What I’m Still Figuring Out
I don’t have all the answers. Should we let AI write tests at all, or should tests be the place where human input is most required because that’s where intent is defined?
But I do know this: Constraining AI test generation from 99 down to 5 focused tests has made our codebase more maintainable, not less tested.
How are you handling AI-generated tests in your projects? Have you found the sweet spot between coverage and maintainability? I’d love to hear what’s working for you—drop me an email at hello@ashishacharya.com.