METR published a randomized controlled trial in July 2025. Sixteen experienced open-source developers — averaging five years on repositories with 22,000+ GitHub stars — were given access to Cursor Pro with Claude 3.5 and 3.7 Sonnet. The result: AI tool use made them 19% slower on actual task completion time. The same developers estimated they were 20% faster.
This is the core tension in AI-augmented engineering in 2026. The tools feel productive. The measured outcomes are inconsistent. And the gap between perception and measurement is where most engineering leader decisions are currently being made without reliable data.
The 84% of developers who use AI tools (Stack Overflow 2025, 49,000 respondents) are not wrong to use them. But only 29% trust the accuracy of AI outputs — down from 70%+ positive sentiment in 2023–2024 to 60% in 2025. 66% report that debugging AI-generated code takes more time than expected. The adoption curve and the trust curve are moving in opposite directions.
What the Productivity Data Actually Shows
Faros AI’s telemetry across 10,000+ developers on 1,255 teams captures the clearest aggregate picture. High-AI-adoption teams merge 98% more pull requests. PR review time increases 91%. Average PR size increases 154%. The volume of code entering review has more than doubled; the cognitive load of reviewing it has increased proportionately.
This is Amdahl’s Law applied to software delivery: accelerating one stage of the pipeline without accelerating the stages around it does not improve overall throughput — it creates a bottleneck downstream. When AI tools accelerate code writing, the constraint moves to code review, testing, and integration. Teams that have not modernized their review and testing infrastructure alongside their AI tooling are generating more work for the non-accelerated stages than they are saving in the accelerated ones.
The one metric that consistently improves: onboarding time. Time to 10th pull request for new developers was cut in half from Q1 2024 through Q4 2025 across teams using AI tools. This is the most validated benefit — AI tools help engineers navigate unfamiliar codebases, understand existing patterns, and become productive on new repositories faster. For organizations that hire frequently or rotate engineers across teams, this is a real, measurable return.
What does not reliably improve without additional investment: overall delivery speed for experienced teams on complex tasks, code quality without supplementary review tooling, and security posture.
The Security Problem Nobody Budgeted For
Veracode’s 2025 GenAI Code Security Report tested 100+ LLMs across 80 coding tasks. AI introduced security vulnerabilities in 45% of cases. An ACM empirical study of GitHub Copilot-generated code found 29.5% of Python snippets and 24.2% of JavaScript snippets contained security weaknesses — including XSS and improper input validation.
Checkmarx found that 34% of organizations report 60%+ of their codebase is now AI-generated. The security implication: if the security review process was designed for a codebase where humans wrote most of the code, and 60% of the code is now AI-generated at higher velocity, the review process is running on assumptions that no longer hold.
The specific failure modes in AI-generated code differ from human-generated code in ways that matter for review:
Context blindness. AI models generate code that works for the immediate function without awareness of the broader system’s security model. A properly parameterized SQL query in isolation can still create an injection vulnerability if it is assembled in a way the model did not see in its local context window.
Plausible-looking insecurity. AI-generated code tends to look correct syntactically and logically. The security issues are often subtle — incorrect scoping, overly broad permissions, missing input boundary validation — and do not trigger automated linters the way syntax errors would.
Pattern repetition across the codebase. When many developers use the same AI model for similar tasks, the same vulnerability patterns appear across the codebase. A single mispattern introduced by the model gets replicated at scale before it is caught.
The engineering response is not to stop using AI tools. It is to treat AI-generated code as requiring the same security review as vendor-supplied code — where you do not extend implicit trust based on the author’s intentions.
The Structural Shift: What Senior Engineers Actually Do Now
The bottleneck in AI-assisted development has moved. It is no longer writing code — that is now fast. It is specifying what needs to be written, reviewing what was written, and validating that what was written is correct and secure.
Senior engineers on teams using AI tools are increasingly acting as orchestrators: writing specifications precise enough for AI tools to produce useful output, reviewing AI-generated code at a level of scrutiny that catches the subtle issues above, and designing the architecture that constrains the solution space the AI is working within. The code authorship function — which historically consumed the majority of senior engineering time — is now a smaller fraction of the total work.
This has a specific implication for team composition that is not widely understood: the productivity benefit of AI tools scales with the quality of senior engineering input. A team with weak specification discipline, unclear architecture, and inadequate review processes does not get more productive with AI tools — it gets more of the same problems at higher velocity. The prerequisite for AI-assisted productivity is the engineering discipline that AI tools do not provide.
52% of developers either do not use AI agents or stick to simpler AI tools, and 38% have no plans to adopt AI agents for deployment, monitoring, or project planning. The experienced developers who are most skeptical — who show the lowest trust (2.6% “highly trust”) and highest distrust (20% “highly distrust”) among seniority cohorts — are the developers whose input quality matters most for AI tool output quality. Their skepticism is partly a rational response to having reviewed AI-generated code and found the limitations firsthand.
What Actually Works at the Team Level
The engineering practices that consistently produce better outcomes from AI tooling:
Specification before generation. AI coding tools amplify the quality of the specification they are given. A vague prompt produces plausible-looking code that may be subtly wrong. A specification that defines inputs, outputs, edge cases, error handling, and the security model it must satisfy produces substantially better output — and makes review faster because the reviewer has a clear standard to check against.
AI-specific security scanning in CI. Traditional SAST tools were tuned on patterns found in human-written code. Supplementing with tools specifically calibrated for AI-generated code vulnerabilities (Checkmarx, Snyk Code, Semgrep with AI-specific rulesets) catches the patterns that standard scanners miss. This needs to run on every PR, not periodically.
Size limits on AI-generated PRs. The 154% PR size increase from AI tool adoption makes individual PRs harder to review meaningfully. Teams that impose size limits — and require AI-generated code to be submitted in reviewable increments rather than large complete implementations — consistently report better review quality and faster defect detection.
Evaluation datasets for AI-generated tests. AI tools generate tests readily. The tests are often syntactically correct and pass trivially. Maintaining a baseline of known-difficult test cases — edge conditions, security boundary inputs, integration scenarios — and requiring AI-generated test suites to cover them prevents the false confidence that comes from high test counts with low coverage quality.
Staged AI tool adoption. Full AI tool adoption across a team simultaneously makes it harder to measure what changed and what broke. Teams that stage adoption — expanding usage incrementally while measuring DORA metrics, defect rates, and review cycle times — can identify whether specific practices are working before committing to them at scale.
The Market Reality
The AI code generation market was valued at $4.91 billion in 2024 and is projected to reach $30.1 billion by 2032 at 27.1% CAGR. GitHub Copilot has 15 million users with 90%+ Fortune 100 adoption. The tools are not going away, and the organizations that figure out how to use them effectively will have a structural advantage over those that either adopt uncritically or avoid adoption out of skepticism.
The engineering leadership question is not whether to use AI tools. It is how to build the surrounding infrastructure — specification discipline, review processes, security tooling, measurement — that makes AI tools net-positive rather than net-complexity.
DORA added a fifth metric in 2025: Rework Rate, measuring how much engineering activity is reactive rather than planned. Teams where AI tools are increasing code velocity but also increasing the volume of rework — bugs caught late, security issues discovered in production, PRs that fail review and are substantially rewritten — are running faster while going backward on the metric that matters.
How we approach this at Insoftex
Per customer approval, we use Claude Code, Cursor, and agentic workflows across our engineering work. The governance model we apply reflects what the research shows: AI tool use is most productive when preceded by specification — we write clear, detailed specs before invoking AI assistance, and treat AI-generated code as requiring the same review rigor as vendor-supplied code.
For clients building engineering teams or evaluating how to restructure existing teams around AI tooling, the pattern we consistently recommend: invest in senior engineering capacity first, because AI tools amplify senior judgment and are limited by junior judgment. The cost of a senior engineer who can write good specifications and review AI output effectively is justified by the leverage it creates on the rest of the team. The cost of AI tools used without that leverage is absorbed by the downstream stages — review, testing, rework — in ways that often do not show up in simple output metrics.
Evaluating how to structure your engineering team for AI-assisted development — or building a senior-first team from scratch? Our Scale service places senior engineers who understand how to get value from AI tooling without the quality and security risks. Most clients start with a Product Pilot to assess current team structure and capacity needs.