What does AI actually do well — and where does it fail — in real production software?
The gap between AI capability in research papers and AI capability inside real production software is larger than most practitioners expect. Models that score impressively on academic benchmarks perform very differently on real documents, real queries, and real user behaviour.
This stream investigates practical AI integration — not what models can theoretically do, but what they reliably do when connected to real data, exposed to real inputs, and asked to operate without a human in the loop to catch mistakes. We focus on document intelligence, structured output extraction, retrieval-augmented generation, and autonomous workflow execution.
The most valuable findings in this stream are the negative ones: places where AI fails in ways that look like success. A model that returns plausible-looking JSON that silently violates the schema. A RAG system that retrieves confidently but retrieves the wrong document. A prompt that worked last month and produces subtly worse output today after a model update nobody announced.
We publish these failure modes because they are predictable, reproducible, and fixable — but only if you know to look for them. The goal is to save practitioners from discovering them the hard way in production.
We ran the same question set against three different models connected to the same retrieval system, then ran the same models against retrieval systems of varying quality. Retrieval quality explained 3x more variance in answer accuracy than model capability. Investing in retrieval quality first is the correct order of priorities for most RAG deployments.
We tested 3 models × 4 retrieval configurations on a 200-question evaluation set with ground-truth answers from a real client knowledge base, measuring exact match, semantic similarity, and factual grounding rate.
Across 12 production structured extraction pipelines, an average of 11% of model outputs contained schema violations that were not surfaced as errors — they returned as syntactically valid JSON that violated semantic constraints (wrong enum values, plausible-but-incorrect field types, missing required conditional fields). These failures were invisible without explicit validation.
We seeded 40 known schema violation patterns into test documents across varying complexity levels and measured what percentage of model outputs violated the schema without triggering a parsing error.
We maintain a regression suite of 50 stable prompts with expected outputs. Running this suite after 6 model updates over 8 months detected measurable output drift in 4 of 6 updates — in 2 cases, the drift was large enough to break downstream logic in production systems. Model updates are not announced for self-hosted models and are inconsistently documented for hosted APIs.
Automated regression suite running against a prompt bank with expected outputs, scored by exact match plus semantic similarity. Threshold for "drift detected" is a 5% or greater change in either metric.
Across 5 production systems using self-consistency checking (running the same query multiple times and measuring agreement), confident hallucinations — cases where the model provided a definitive wrong answer — dropped by 83% on average. The cost increase of 40–60% is justified for high-stakes outputs but not for routine information retrieval.
We ran A/B tests on production query traffic, splitting requests between standard single-pass inference and self-consistency checking (5 independent samples). Confident hallucinations were identified by cross-referencing outputs against verified source documents.
We test against real document corpora from Studio projects (anonymised) and purpose-built adversarial datasets. No synthetic benchmarks — we test only conditions that reflect how these systems behave in production.
Primary metrics are accuracy (against ground truth), false confidence rate (wrong answer presented with high confidence), and silent failure rate (technically valid output that violates intended semantics). We track these over time to detect regression after model updates.
We document model versions, prompt versions, and dataset characteristics for every experiment. When we discover that a finding changes after a model update, we publish an update to the original finding.
Hypothesis: Routing queries to the most semantically relevant data source before retrieval will improve answer accuracy more than improving retrieval within a single source.
Answer relevance improved 38% on a 200-question evaluation set with no increase in latency.
Query routing is now a first-class step in every RAG pipeline we build. Routing based on query intent (procedural vs. factual vs. comparative) outperformed routing based on semantic similarity to source descriptions alone.
Hypothesis: A well-prompted model can classify document types (invoice, contract, report, letter) with >90% accuracy without fine-tuning on domain-specific examples.
94% accuracy on structured document types; 61% on free-form layouts without consistent visual structure.
Zero-shot classification is reliable for templated documents and unreliable for free-form ones. We now use classification confidence to route low-confidence documents to a human-in-the-loop step before downstream processing.
Hypothesis: An AI pipeline can generate weekly operational reports from structured data sources with quality indistinguishable from human-written reports.
Quantitative sections were rated equivalent to human-written. Narrative interpretation sections were rated significantly lower on coherence and appropriate emphasis.
AI is reliable for data-to-text generation when the interpretation rules are explicit. Qualitative analysis that requires judgment about what matters requires human involvement. We now use AI for the data synthesis layer and human editors for the interpretation layer.