1.6M Git Events Show AI Code Needs More QA

M
Matthew Diakonov

1.6M Git Events Show AI Code Needs More QA

Analysis of large-scale git activity is revealing an uncomfortable trend: AI-generated code is scaling faster than the QA processes designed to catch its mistakes. The code ships faster, but the review quality has not kept up.

This is not speculation. GitClear's research across 211 million changed lines of code from repositories owned by Google, Microsoft, Meta, and enterprise organizations shows measurable degradation in code quality metrics since AI coding tools became mainstream. CodeRabbit's State of AI vs. Human Code Generation report confirms it: AI-written code produces 1.7x more issues than human-written code across production systems.

The question is no longer whether AI-generated code needs more QA. It does. The question is how to build review processes that scale with the output.

The Volume Problem: Math That Does Not Work

When a single developer can generate 10x more code with AI assistance, the review bottleneck shifts. Previously, the limiting factor was writing speed. Now it is review capacity.

Here is the concrete math. A team of five developers using AI coding tools can produce the output of 20-30 developers, but they still have five humans available for code review. If each developer generates 500 lines of code per day instead of 100, and review throughput stays flat, the review backlog grows every single day.

Recent telemetry from 10,000+ developers across 1,255 teams confirms this at scale: PRs merged increased by 98%, but review times grew by 91%. Teams respond to this pressure in predictable ways - they either let reviews slip or rubber-stamp changes. Both options increase bug rates.

The numbers from developer surveys reinforce this dynamic:

  • PRs per author increased 20% year-over-year
  • Incidents per PR increased 23.5% in the same period
  • Change failure rate increased 30%
  • Cycle time increased 9% despite theoretically faster development

This is the paradox of AI-assisted development: gross velocity goes up, but net velocity - code that stays deployed and works correctly - does not improve proportionally.

What the Data Actually Shows: Five Patterns in AI-Heavy Codebases

Projects with high AI-generated code ratios show five distinct patterns when you analyze their git event streams.

Pattern 1: PR Sizes Balloon

Instead of focused 50-line changes, PRs regularly hit 200-500 lines because AI generates full implementations rather than incremental changes. This matters enormously for review quality.

Research across millions of PRs shows the relationship between size and defect detection is steep:

PR Size (lines) Defect Detection Rate Typical Review Time
Under 100 87% Under 1 hour
200-400 ~75% 2-4 hours
400-1,000 Below 70% 4-8 hours
Over 1,000 28% Often rubber-stamped

Microsoft's engineering team found that PRs under 300 lines received 60% more thorough reviews. When they implemented automated warnings for PRs over 400 lines, post-merge defects dropped by 35%.

AI coding tools default to generating complete implementations. Ask Copilot or Claude to add a feature and you will get a 300-line PR, not the 40-line incremental change a human would have written. Each additional 100 lines increases review time by 25 minutes, and at 2,000+ lines, reviewers experience cognitive overload that leads to rushed approvals.

Pattern 2: Test Coverage Gaps Widen

AI generates implementation code faster than test code, and developers often skip writing tests for AI-generated code they assume is correct. This assumption is dangerous.

AI-generated code has 1.75x more logic and correctness errors than human-written code. These are exactly the kinds of bugs that tests are supposed to catch. When you combine higher error rates with lower test coverage, you get a compounding quality problem.

The specific categories where AI code underperforms are revealing:

  • Logic and correctness errors: 1.75x more frequent
  • Maintainability issues: 1.64x more frequent
  • Security vulnerabilities: 1.57x more frequent
  • Performance issues: 1.42x more frequent

Security deserves special attention. AI-generated code is 2.74x more likely to introduce XSS vulnerabilities, 1.91x more likely to create insecure object references, 1.88x more likely to mishandle passwords, and 1.82x more likely to implement insecure deserialization. These are not theoretical risks - they are measured rates from production codebases.

Pattern 3: Code Churn Doubles

Code churn - the percentage of code that gets discarded less than two weeks after being written - is one of the clearest signals of quality problems. GitClear's longitudinal data tells a stark story:

  • 2020-2022: Code churn held steady at 3-4%
  • 2023: Churn jumped to 5.5% as AI coding tools gained adoption
  • 2024: Churn projected to exceed 7%, double the 2021 rate

This means more than 7% of all code changes are reverted within two weeks. That is wasted review time, wasted deployment effort, and wasted incident response.

The copy/paste problem compounds this. The share of copy/pasted lines surged from 8.3% in 2020 to 12.3% in 2024 - a 48% relative increase. Simultaneously, the share of "moved" (refactored) code plummeted from 24.1% to 9.5%. AI tools are producing more duplicated, less maintainable code that needs to be rewritten shortly after it ships.

Pattern 4: Rollback Frequency Climbs

More code ships, more bugs ship, and more emergency reverts happen. As 2025 progressed, more production incidents and postmortems pointed to AI-generated code as a contributing factor.

AI-generated code had up to 75% more logic and correctness issues in areas that were more likely to contribute to downstream incidents. The operational costs - missed SLAs, reliability regressions, customer churn - began eroding the cost savings that AI-generated code was supposed to deliver.

Pattern 5: Developer Trust Erodes

61% of developers agree that "AI often produces code that looks correct but isn't reliable." Only 33% of developers say they trust AI tool output, with just 3% reporting high trust. Meanwhile, 66% report spending more time fixing "almost-right" AI-generated code than they would have spent writing it from scratch.

71% of developers do not merge AI-generated code without manual review. The trust deficit is real and justified by the data.

A Practical Framework for Scaling QA

The practical solution is not reviewing every line. It is building a layered review system that catches different categories of problems at different stages.

Layer 1: Automated Gates in CI/CD

Set up automated quality gates that run before any human sees the code. These should block merges, not just warn.

Static analysis and linting:

# .github/workflows/ai-code-quality.yml
name: AI Code Quality Gates
on: [pull_request]

jobs:
  quality-gates:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Semgrep security scan
        uses: semgrep/semgrep-action@v1
        with:
          config: >-
            p/security-audit
            p/owasp-top-ten
            p/xss

      - name: Check PR size
        run: |
          LINES_CHANGED=$(git diff --stat origin/main...HEAD | tail -1 | awk '{print $4}')
          if [ "$LINES_CHANGED" -gt 400 ]; then
            echo "::warning::PR exceeds 400 lines. Consider splitting."
            echo "Large PRs have 28% defect detection vs 87% for small PRs."
          fi

      - name: Enforce test coverage threshold
        run: |
          # Fail if new code doesn't have at least 60% coverage
          pytest --cov=src --cov-fail-under=60

Dependency validation is critical for AI-generated code. AI tools sometimes hallucinate packages that do not exist, or reference outdated versions with known vulnerabilities. Run Snyk or Dependabot checks as required status checks, not optional ones.

# Check for hallucinated packages before install
pip install pip-audit
pip-audit --require-hashes --strict

# For JavaScript projects
npm audit --audit-level=high

Layer 2: AI-on-AI Review

Let one model review another model's output. This is not redundant - it catches a different class of issues than static analysis.

Tools like CodeRabbit, GitHub Copilot code review, or Anthropic's Claude code review can evaluate PRs for:

  • Whether the implementation matches the described intent
  • Logical errors that static analysis misses
  • Unnecessary complexity or over-engineering
  • Missing error handling and edge cases

Configure AI review as a required status check in your repository settings. This creates an enforceable quality gate:

// .github/branch-protection.json (conceptual)
{
  "required_status_checks": {
    "strict": true,
    "contexts": [
      "semgrep",
      "test-coverage",
      "ai-code-review",
      "dependency-audit"
    ]
  }
}

The key insight is that AI reviewers and human reviewers catch different things. AI reviewers are good at spotting pattern violations, security anti-patterns, and logical inconsistencies. Human reviewers are better at evaluating whether the code solves the right problem and fits the architecture.

Layer 3: Focused Human Review

Human reviewers should not waste time on problems that automated tools already caught. Focus human attention on three things:

1. Architecture decisions. Does this change fit the system design? Is the abstraction boundary in the right place? AI tools optimize locally - they generate correct code for the immediate task but frequently miss the broader architectural context.

2. Business logic correctness. Does the code actually solve the problem the ticket describes? AI can generate a technically correct implementation of the wrong thing. This is the "looks correct but isn't reliable" problem that 61% of developers report.

3. Security boundaries. Even with automated security scanning, human review of authentication flows, authorization checks, and data handling is irreplaceable. AI-generated code is nearly 3x more likely to introduce XSS vulnerabilities - automated scanners catch some of these, but not the context-dependent ones.

Layer 4: Post-Merge Monitoring

Set up automated monitoring for code that was recently merged, especially AI-generated code:

# Example: Track code churn for AI-authored commits
# Flag files that get modified within 14 days of creation

import subprocess
from datetime import datetime, timedelta

def get_recent_files(days=14):
    """Find files created in the last N days that have already been modified."""
    cutoff = (datetime.now() - timedelta(days=days)).isoformat()

    # Get files added in the last N days
    new_files = subprocess.run(
        ["git", "log", f"--since={cutoff}", "--diff-filter=A",
         "--name-only", "--pretty=format:"],
        capture_output=True, text=True
    ).stdout.strip().split('\n')

    # Check which ones were already modified
    churned = []
    for f in new_files:
        if not f:
            continue
        modify_count = subprocess.run(
            ["git", "log", f"--since={cutoff}", "--oneline", "--", f],
            capture_output=True, text=True
        ).stdout.strip().count('\n')
        if modify_count > 1:  # Created + modified = churn
            churned.append((f, modify_count))

    return sorted(churned, key=lambda x: x[1], reverse=True)

# Files with highest churn indicate quality problems
for filepath, changes in get_recent_files():
    print(f"  {changes} changes: {filepath}")

Track these metrics weekly:

  • Code churn rate: What percentage of new code gets modified within 14 days?
  • Rollback frequency: How many reverts per 100 merged PRs?
  • Incident attribution: What percentage of production incidents trace to recently merged AI-generated code?
  • Review coverage: What percentage of lines in merged PRs received human review comments?

Enforcing PR Size Limits: The Highest-Leverage Change

If you only do one thing from this article, enforce PR size limits. The data is overwhelming: small PRs get better reviews, catch more bugs, and merge faster.

Here is a practical approach:

#!/bin/bash
# .git/hooks/pre-push or CI check
# Warn at 200 lines, block at 500 lines

MAX_WARN=200
MAX_BLOCK=500

LINES=$(git diff --stat origin/main...HEAD | tail -1 | grep -oP '\d+(?= insertion)' || echo 0)
LINES=$((LINES + $(git diff --stat origin/main...HEAD | tail -1 | grep -oP '\d+(?= deletion)' || echo 0)))

if [ "$LINES" -gt "$MAX_BLOCK" ]; then
    echo "ERROR: PR has $LINES changed lines (max: $MAX_BLOCK)."
    echo "Split this into smaller PRs for effective review."
    echo ""
    echo "Tip: Use 'git add -p' to stage partial changes."
    exit 1
elif [ "$LINES" -gt "$MAX_WARN" ]; then
    echo "WARNING: PR has $LINES changed lines."
    echo "PRs over 200 lines have significantly lower defect detection rates."
    echo "Consider splitting if possible."
fi

This forces AI-assisted workflows to break work into reviewable chunks rather than generating monolithic implementations.

The Refactoring Deficit

One of the most concerning trends in the GitClear data is the collapse of refactoring. The share of "moved" code - a proxy for refactoring activity - dropped from 24.1% in 2020 to 9.5% in 2024. Meanwhile, newly added code grew from 39% to 46%.

AI tools are biased toward generating new code rather than improving existing code. When a developer asks for a feature, the AI writes a new implementation rather than refactoring the existing codebase to accommodate it. Over time, this creates a codebase full of duplicated patterns, inconsistent abstractions, and mounting technical debt.

By 2026, 75% of technology decision-makers are projected to face moderate to severe technical debt from AI-speed practices. The fix is not to stop using AI tools - it is to explicitly budget time for refactoring and to train AI tools on your specific codebase patterns so they extend existing abstractions rather than creating new ones.

Practical steps:

  1. Track the refactoring ratio. Measure what percentage of your git activity is moves/renames versus additions. If it is below 15%, you are accumulating debt.
  2. Schedule refactoring sprints. Dedicate 20% of sprint capacity to paying down AI-generated technical debt.
  3. Use AI for refactoring, not just generation. Tools like Claude and Copilot can refactor existing code well when given explicit instructions to do so. The problem is that developers default to "add this feature" prompts rather than "refactor this module to support this feature" prompts.

What This Means for Teams in 2026

The era of AI-speed development without AI-scale QA is ending. Teams that built their workflows around maximizing code generation volume in 2024-2025 are now dealing with the consequences: higher incident rates, growing technical debt, and developer frustration with "almost-right" code.

The teams that will win in 2026 are the ones that treat QA as a first-class concern in their AI-assisted workflows - not an afterthought. That means:

  • Automated quality gates that block bad code before humans see it
  • AI-on-AI review as a required check, not an optional nice-to-have
  • Human review focused on architecture, business logic, and security
  • Post-merge monitoring with churn and rollback tracking
  • Enforced PR size limits to keep reviews effective
  • Explicit refactoring budgets to counter AI's bias toward new code

The tools exist. The data is clear. The only question is whether your team will adapt before the technical debt catches up.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts