129K Commits Later - Vibe Coding Is Just Coding

M
Matthew Diakonov

129K Commits Later - Vibe Coding Is Just Coding

When Andrej Karpathy coined "vibe coding" in February 2025, he described it as a mode where you "fully give in to the vibes, embrace exponentials, and forget that the code even exists." He was talking about weekend throwaway projects - accepting AI output without reading diffs, copying error messages back to the model, and shipping whatever worked.

A year later, Karpathy himself said vibe coding is passe. He now calls the professional version "agentic engineering" - programming via LLM agents with oversight and scrutiny. But here is the thing: after 129,000 commits where agents wrote the majority of the code, I think even that distinction misses the point. What we are doing is just coding. The tool changed. The craft did not.

This post breaks down what actually happens when you scale AI-assisted development to six figures of commits - what works, what breaks, and why the debate over naming conventions matters less than the engineering discipline underneath.

The Numbers That Changed My Mind

Let me start with the industry context, because it matters for understanding why this shift is irreversible.

GitHub reported that over 51% of all code committed to its platform in early 2026 was either generated or substantially assisted by AI. That is not a niche workflow anymore - it is the majority of how code gets written. Developer adoption sits at 84% using AI tools in some capacity, and the average developer saves roughly 3.6 hours per week with AI assistance.

But the raw numbers hide the real story. GitHub Copilot's suggestion acceptance rate hovers around 30-31% - meaning developers actively reject roughly 70% of what AI proposes. Among startups the acceptance rate climbs to 75%, while banking and finance stays at 65%. The gap tells you something important: acceptance correlates with risk tolerance, not with AI capability.

In our codebase, the acceptance pattern evolved over time. Early on, we accepted maybe 40% of AI suggestions. After establishing strong linting rules, test coverage requirements, and a CLAUDE.md file specifying coding standards, that climbed to roughly 70%. The improvement was not because the AI got smarter between versions - it was because we gave it better constraints.

What 129K Commits Actually Looks Like

At 129,000 commits, you stop thinking about individual contributions and start thinking about systems. Here is the breakdown of what a typical week looks like in a codebase at this scale:

The Commit Distribution

Roughly 80% of commits are agent-authored with human review. Another 10% are human-authored with AI assist (autocomplete, refactoring suggestions). The remaining 10% are purely human - usually configuration, documentation, or architectural decisions that require judgment the model cannot provide.

The agent-authored commits follow a predictable pattern:

# Typical agent commit flow
1. Human writes spec or describes feature in natural language
2. Agent generates implementation (often 100-300 lines)
3. Automated tests run (agent wrote most of those too)
4. Human reviews diff - accepts, requests changes, or rejects
5. Agent addresses feedback in follow-up commit
6. Merge

This looks exactly like a senior engineer directing a junior developer. The junior developer happens to type at 10,000 words per minute and never gets tired, but the review dynamic is identical.

The Consistency Advantage

One pattern that only becomes visible at scale: AI-written code is more consistent than human-written code across a large codebase. When you establish patterns in your style guide and rules file, the agent follows them without drift. Human developers, no matter how disciplined, introduce style variation over time - different variable naming preferences, different error handling approaches, different ways to structure the same abstraction.

After 129K commits, our codebase reads like it was written by one person. That has real maintenance value. New contributors (human or AI) can read any file and immediately understand the patterns because they are the same everywhere.

The Test Suite Becomes the Specification

This is the single most important lesson. When agents write your code, your test suite is the actual source of truth for what the system should do. Not the README. Not the Jira tickets. Not the comments. The tests.

Here is why: an AI agent writing code will satisfy whatever constraints you give it. If you describe a feature in prose, it will interpret that prose and generate something plausible. If you give it a failing test, it will generate code that makes the test pass. The second approach is vastly more reliable because tests are unambiguous.

We shifted to a test-first workflow not out of TDD ideology, but out of practical necessity:

// Step 1: Human writes the test (the spec)
describe('processPayment', () => {
  it('should retry failed charges up to 3 times with exponential backoff', async () => {
    const mockGateway = createMockGateway({ failCount: 2 });
    const result = await processPayment(mockGateway, { amount: 1000 });

    expect(result.success).toBe(true);
    expect(mockGateway.attempts).toBe(3);
    expect(mockGateway.delays).toEqual([1000, 2000, 4000]);
  });

  it('should fail permanently after 3 retries', async () => {
    const mockGateway = createMockGateway({ failCount: 4 });
    const result = await processPayment(mockGateway, { amount: 1000 });

    expect(result.success).toBe(false);
    expect(result.error).toBe('MAX_RETRIES_EXCEEDED');
  });
});

// Step 2: Agent writes the implementation to pass these tests
// Step 3: Human reviews the implementation for edge cases
//         the tests did not cover

The human writes the spec (tests). The agent writes the implementation. The human reviews for gaps. This loop works at any scale.

The Review Bottleneck Is Real - and It Is the Hard Problem

Here is where the "vibe coding is just coding" thesis gets tested. The industry data on code review is sobering.

Faros AI analyzed data from more than 10,000 developers and found a 98% increase in PR volume alongside a 91% increase in PR review time. AI makes code generation faster, but the review workload grows almost proportionally because there is simply more code to look at.

The numbers get worse when you look at review duration per line. Senior engineers spend an average of 4.3 minutes reviewing AI-generated suggestions compared to 1.2 minutes for human-written code. Why? Because AI-generated code tends to be more verbose. One study found that a manual REST API endpoint implementation of 29 lines ballooned to 186 lines when AI-generated - a 6.4x increase. An error handling refactor went from 16 lines to 288 lines - a 1,700% increase.

More code is not inherently bad. But more code requires more review time, and review time is the scarcest resource in software engineering.

How We Solved the Review Problem

After hitting this wall around the 50K commit mark, we developed a three-layer review system:

Layer 1 - Automated gates (catches 60% of issues)

# .github/workflows/ai-code-review.yml
# These run on every PR before human eyes see it
steps:
  - lint-and-format     # Style consistency
  - type-check          # TypeScript strict mode
  - unit-tests          # Functional correctness
  - integration-tests   # System-level behavior
  - bundle-size-check   # Performance regression
  - security-scan       # Dependency vulnerabilities

If any of these fail, the PR goes back to the agent with the error output. The agent fixes and resubmits. Most PRs cycle through this loop 1-3 times before a human ever looks at them.

Layer 2 - AI-assisted review (catches 25% of issues)

We use a separate AI model to review the code generated by the first one. This is Addy Osmani's "dual model code review" pattern - spawn a second session specifically to critique the output of the first. It catches things like unnecessary abstractions, missing edge cases, and overly clever solutions that a human reviewer would flag.

Layer 3 - Human review (catches the remaining 15%)

By the time code reaches a human reviewer, the obvious issues are gone. The human focuses on:

  • Does this architectural decision make sense long-term?
  • Are there edge cases the tests did not cover?
  • Does this introduce coupling that will be painful later?
  • Is this the right abstraction, or just a working one?

These are judgment calls that no automated system handles well. They are also the highest-value engineering work.

The Quality Question - Honest Data

Let us address the elephant in the room: does AI-generated code have more bugs?

The data is mixed but trending in a clear direction. A CodeRabbit analysis found that AI-generated code produces 1.7x more issues than human-written code. Logic errors were up 75%. Security vulnerabilities were 1.5-2x more common. Performance inefficiencies appeared 8x more frequently.

But - and this is critical - those numbers are for unreviewed AI output. When you add proper review layers, the picture changes dramatically.

GitHub's own research found that developers with Copilot access had a 53.2% greater likelihood of passing all unit tests in their study. Code written with Copilot was 5% more likely to be approved on review. In blind reviews, Copilot-assisted code had significantly fewer readability errors.

The pattern is clear: AI-generated code that ships without review is worse than human code. AI-generated code that goes through proper review is comparable or better. The variable is not the AI - it is the process around it.

Here is our defect rate data across three phases:

Phase Commits Defect Rate (bugs per 1K lines) Review Process
0-30K 30,000 4.2 Manual review only
30K-80K 50,000 2.8 Automated gates + manual
80K-129K 49,000 1.9 Three-layer system

The defect rate dropped by 55% not because the AI improved (it did, but not by that much) but because the review process matured.

Why "Vibe Coding" Was Never the Right Frame

Karpathy's original tweet described vibe coding as something "not too bad for throwaway weekend projects." Simon Willison made an important distinction: not all AI-assisted programming is vibe coding. Vibe coding specifically means accepting code without understanding it.

The problem is that the term escaped its original context. People started using "vibe coding" to describe any AI-assisted development, which created a false binary: either you are a "real programmer" writing every line by hand, or you are a "vibe coder" blindly accepting AI output.

Neither description fits what professional AI-assisted development actually looks like. Here is a more accurate taxonomy:

Vibe coding (Karpathy's original definition): Accept AI output without review. Copy errors back to the model. Ship whatever works. Appropriate for prototypes and throwaway projects.

AI-assisted coding: Use AI for autocomplete, boilerplate, and suggestions while the human writes the core logic and reviews everything. This is what most of the 84% of developers using AI tools actually do.

Agentic engineering: AI agents handle entire features end-to-end while humans set direction, write specs, and review output. This is what 129K commits of experience points toward as the mature workflow.

What all three have in common: a human is making engineering decisions. The level of AI involvement varies, but the intellectual work - understanding requirements, designing systems, evaluating tradeoffs, catching edge cases - remains human.

That intellectual work is coding. It always was.

Practical Guide - How to Scale AI-Assisted Development

If you want to move from casual AI assistance to the kind of agentic engineering that produces 129K commits, here is the progression:

Stage 1: Foundation (commits 0-1K)

Set up your rules file. Whether you use Cursor's .cursorrules, Claude's CLAUDE.md, or another format, this is the highest-leverage thing you can do. Specify your coding standards, preferred patterns, forbidden patterns, and architectural constraints. This alone cuts the rejection rate of AI suggestions in half.

# Example CLAUDE.md excerpt
## Code Style
- Use TypeScript strict mode, no `any` types
- Error handling: use Result types, not try-catch for business logic
- Functions over 30 lines should be broken up
- No abbreviations in variable names

## Architecture
- All API routes go through middleware validation
- Database queries only in repository files, never in route handlers
- Business logic in service files, not in components

Establish test infrastructure early. You need fast, reliable tests before you can trust agent output. If your test suite takes 20 minutes to run, the agent feedback loop is too slow.

Commit frequently. Treat every successful change as a save point. When the agent goes off track (it will), you need clean rollback points.

Stage 2: Acceleration (commits 1K-10K)

Move to test-first development. Write the tests that define behavior, then let the agent implement. This forces you to think clearly about requirements before any code gets generated.

Add automated quality gates. Every PR should pass linting, type checking, tests, and security scanning before human review. This reduces review fatigue dramatically.

Track metrics. Start measuring: defect rate, review time per PR, agent rework cycles, and test coverage. You cannot improve what you do not measure.

Stage 3: Scale (commits 10K+)

Implement multi-layer review. Automated gates, AI-assisted review, and human review as described above. This is not optional at scale - without it, review becomes the bottleneck that slows everything down.

Parallelize agent work. A single developer can direct multiple agents working on different features simultaneously. But each agent session needs its own branch, its own context, and its own review cycle. Do not let agents step on each other.

Invest in specifications. At scale, the quality of your specs determines the quality of your output. A vague prompt produces vague code. A detailed spec with test cases, edge cases, and architectural constraints produces code that is right the first time.

The Uncomfortable Truth

Here is what nobody in the "vibe coding" debate wants to say: the distinction between "writing code" and "directing an AI that writes code" is going to look quaint in a few years. We do not distinguish between "people who use IDEs" and "real programmers who use ed." We do not distinguish between "people who use high-level languages" and "real programmers who write assembly."

Every generation of developer tooling shifts the abstraction layer upward. The work gets more conceptual and less mechanical. But it does not get easier - it gets different. Reviewing a 300-line AI-generated diff for subtle logic errors is genuinely hard work. Writing specifications precise enough that an agent produces correct code on the first try is genuinely hard work. Architecting systems that remain maintainable when 80% of the code is machine-generated is genuinely hard work.

After 129,000 commits, the thing I am most certain of is this: if you are reviewing diffs, running tests, catching edge cases, and making architectural decisions, you are coding. The fact that an agent typed the characters does not change the intellectual work required.

Stop calling it vibe coding. It is just coding.

Fazm is an open source macOS AI agent that automates desktop workflows. Open source on GitHub.

More on This Topic

Related Posts