Labs · Benchmarks

What AI can actually do. Numbers, not marketing.

We publish our benchmark results publicly because we think the industry needs more honesty about what AI development tools can and cannot do. These are real numbers from real workloads, updated quarterly.

Updated Q1 2026Real workloadsBlind evaluationOpen methodology

Results

Five benchmark categories.

Each benchmark measures a specific capability against labelled data from real Studio projects. Methodology notes follow each card.

Q1 2026

Code generation

Spec-to-code accuracy

78%

first-pass accuracy on medium complexity features

Simple features: 94%

High complexity: 52%

Sample size: 840 tasks

Measured on features with written acceptance criteria. "Accuracy" means the generated code passes all acceptance criteria on first attempt without human modification. Complex features involving multi-service integrations score lower.

Q4 2025

Security

Security issue detection rate

85%

detection rate on known vulnerability patterns

False positive rate: 12%

Novel patterns: 41%

Sample size: 320 scans

Tested against a labelled set of known vulnerability patterns (OWASP Top 10 + SANS 25). Detection rate is high for known patterns; novel attack vectors not in training data perform significantly worse.

Q1 2026

Testing

AI-generated test coverage

94%

average line coverage on newly generated code

Edge case coverage: 71%

Integration tests: 68%

Sample size: 1,200 modules

Coverage is high on the happy path and common error cases. Edge cases that require deep domain knowledge — unusual state combinations, race conditions, subtle business rules — still require human-authored tests.

Q4 2025

Documentation

Documentation completeness

89%

alignment between docs and actual code behaviour

API docs: 96%

Architecture docs: 74%

Sample size: 480 modules

Alignment measured by asking engineers to use only the documentation to implement usage examples, then checking if examples work. API-level documentation scores higher than architecture-level, which requires understanding of intent.

Q1 2026

DevOps

Deployment pipeline reliability

99.2%

successful deployments without human intervention

MTTR on failure: < 12 min

Rollback success: 100%

Sample size: 2,400 deploys

Measured across all Studio client projects over 18 months. Failures (0.8%) fall into three categories: environment-specific configuration drift, external service outages, and manual override errors. AI-generated infrastructure code has a lower failure rate than human-written equivalent.

Methodology

How we measure.

Benchmarks are only as useful as their methodology is honest. Here is exactly how these numbers are collected and what "updated quarterly" means in practice.

Real workloads only

All benchmarks use real tasks from Studio client projects, anonymised and with client permission. Synthetic test cases are used only to supplement, never as the primary measurement.

Blind evaluation

Where possible, evaluation is done by an engineer who does not know whether the output was AI-generated or human-written. This removes confirmation bias from quality assessments.

Quarterly update cycle

Numbers are recalculated every quarter with fresh data. Published benchmarks include the measurement date. We do not retroactively improve historical figures.

Failure classification

Every failure is classified by type — model error, prompt design, tool failure, evaluation error. This lets us attribute improvements to their actual causes.

On quarterly updates: Numbers are recalculated in January, April, July, and October. Each published figure is timestamped. If a model update significantly changes performance between quarterly cycles, we publish an interim update and note the cause. We do not retroactively edit historical benchmark records.

Limitations

What these numbers do not tell you.

We believe documenting limitations is as important as documenting results. Read these before drawing conclusions from the numbers above.

These are our numbers on our workloads

Benchmark results vary significantly by codebase maturity, domain, and task complexity. Do not extrapolate these numbers to your specific project without understanding the measurement conditions.

Coverage ≠ correctness

High test coverage does not mean the tests are testing the right things. Our 94% coverage figure measures line coverage, not semantic correctness of the test assertions.

Complexity is a strong mediating variable

Every metric has a strong complexity dependency. Simple tasks score significantly higher. Our reported numbers are averages across complexity bands — the distribution is wide.

Models change frequently

AI model capabilities improve (and occasionally regress) with new versions. Our benchmarks reflect the model versions in use at measurement time, noted in each benchmark record.

Where AI still struggles

Multi-file refactoring: Coordinated changes across many files with complex interdependencies — especially refactors that require understanding implicit contracts between modules.

Performance debugging: Identifying non-obvious bottlenecks, especially in concurrent systems where the problem is an emergent property of multiple components interacting.

Security in novel attack surfaces: Vulnerability patterns that do not appear in training data. AI security tools excel at known patterns, not novel ones.

Long-horizon planning: Architectural decisions that require predicting how a system will need to evolve over 12-18 months. Short-context reasoning performs well; long-range planning still needs humans.

Legacy codebase understanding: Code with undocumented conventions, implicit state, and historical decisions not captured anywhere. The more context is tacit, the worse AI performs.

Comparison

AI-native vs. traditional development.

A direct comparison across time, cost, and quality dimensions. We include dimensions where traditional development wins — because pretending otherwise helps nobody.

Dimension	Traditional development	AI-native pipeline
Time to first working prototype	4–8 weeks	5–10 days
Code generation throughput	~400 lines/day	2,000–5,000 lines/day
Test coverage on new code	60–75%	88–96%
Documentation completeness	40–60%	85–92%
Cost per feature (small)	$800–$2,000	$200–$600
Architectural decision quality	High (with senior devs)	Medium (human oversight needed)
Novel problem-solving	High	Medium (pattern-dependent)
Security audit depth	High (with specialists)	Medium (known patterns only)
Long-term maintenance cost	Variable	Lower (consistent patterns)
Deployment reliability	94–98%	98.5–99.5%

Traditional development figures based on industry surveys (Stack Overflow Developer Survey 2025, GitLab DevSecOps Report 2025) and our own experience running hybrid projects. AI-native figures are our internal measurements. Both assume competent practitioners.