Research methodology
Our approach to experiments, publishing, and maintaining intellectual honesty in applied AI research. We run structured experiments, document everything, and publish regardless of outcome.
The research cycle
Every experiment we run follows this cycle — from question to publication. The steps are not guidelines. Skipping any of them is how labs end up publishing results they cannot reproduce.
A good research question is falsifiable and practically relevant — it must be possible to get a definitive answer, and the answer must matter to real engineering decisions. We do not investigate "how good is AI at X?" We investigate "does approach A outperform approach B on metric M in context C?"
Falsifiable + practically relevant. Both conditions required.Specific, measurable, tied to a practical outcome. Not "LLMs are good at classification" but "LLMs will achieve >90% accuracy on invoice classification, matching or exceeding our existing rule-based system." The hypothesis defines success and failure before you start.
Write the hypothesis before touching any data.Control conditions, success metrics, and failure criteria are defined upfront — before any code is written. What constitutes a confirmed hypothesis? What constitutes a failed one? What would make you abandon the experiment early? Document all of this.
Pre-register success criteria. No moving the goalposts.Minimum viable experiment first. Do not over-engineer the experimental setup before you have signal. A lightweight version that can confirm or deny the hypothesis in two days is more valuable than a comprehensive study that takes three weeks to set up.
Minimum viable experiment. Signal first, fidelity second.Methods, environment, model versions, prompts, data samples, observations — all recorded during the experiment, not reconstructed afterward. This is where most lab processes fail: documentation that relies on memory is documentation that will be wrong.
Document as you go. Memory is not a research tool.Results go on the experiment log regardless of outcome. Notable findings — particularly unexpected ones, significant failures, or results that contradict common assumptions — become longer-form articles on the Labs blog. The publication step is not optional.
Publication is part of the experiment, not a postscript.Failed experiments define the next question. A confirmed hypothesis opens up the next-order question. Abandoned experiments leave a documented record of the dead end so future researchers do not repeat it. Every experiment either confirms, denies, or redirects — all three outcomes are useful.
The cycle ends when we run out of questions. It has not ended yet.Publication standards
Publication is not an afterthought — it is built into the experimental design. Before we run an experiment, we know what a publishable result looks like.
Reproducibility
A finding that cannot be reproduced is not a finding. Every experiment we design is built with reproducibility as a constraint — not an aspiration.
This means documenting not just what we did, but what environment we did it in, what exact model versions we used, what data we worked with, and what parameters we set. A published experiment should be executable by anyone who reads it.
Where code and datasets are not proprietary, we open-source them. We release tools, prompts, and evaluation frameworks where we can. Research that lives only in a blog post is research that cannot be verified.
Example: environment spec
This is the level of documentation we include in every published experiment:
Research ethics
Applied AI research is young and the norms are still forming. We operate by principles that we think should be the standard — even when they are not convenient.
Results are reported in full. If an experiment produced mixed or contradictory results, we publish the full distribution — not the subset that supports the hypothesis. An honest negative result is worth more than a misleading positive one.
Every experiment has scope constraints — sample sizes, controlled conditions, specific model versions, particular document types. We state these limitations explicitly. Findings that are true in our test conditions may not generalize; we say so.
A finding is what the data showed. A recommendation is what we think practitioners should do based on that data. These are different things. We keep them clearly separated in all published work — a finding becomes a recommendation only after explicit reasoning.
When our findings confirm or contradict existing research or published work from others, we say so. We are not the first people to run experiments in applied AI. Situating our work in the broader landscape makes it more useful, not less.
See it in practice
Every experiment we have run — completed, failed, and abandoned — is published in the experiment log. Methodology in action.