Research methodology

How we research.

Our approach to experiments, publishing, and maintaining intellectual honesty in applied AI research. We run structured experiments, document everything, and publish regardless of outcome.

The research cycle

Seven steps. Every experiment.

Every experiment we run follows this cycle — from question to publication. The steps are not guidelines. Skipping any of them is how labs end up publishing results they cannot reproduce.

Identify the question

A good research question is falsifiable and practically relevant — it must be possible to get a definitive answer, and the answer must matter to real engineering decisions. We do not investigate "how good is AI at X?" We investigate "does approach A outperform approach B on metric M in context C?"

Falsifiable + practically relevant. Both conditions required.

Form a hypothesis

Specific, measurable, tied to a practical outcome. Not "LLMs are good at classification" but "LLMs will achieve >90% accuracy on invoice classification, matching or exceeding our existing rule-based system." The hypothesis defines success and failure before you start.

Write the hypothesis before touching any data.

Design the experiment

Control conditions, success metrics, and failure criteria are defined upfront — before any code is written. What constitutes a confirmed hypothesis? What constitutes a failed one? What would make you abandon the experiment early? Document all of this.

Pre-register success criteria. No moving the goalposts.

Run it

Minimum viable experiment first. Do not over-engineer the experimental setup before you have signal. A lightweight version that can confirm or deny the hypothesis in two days is more valuable than a comprehensive study that takes three weeks to set up.

Minimum viable experiment. Signal first, fidelity second.

Document thoroughly

Methods, environment, model versions, prompts, data samples, observations — all recorded during the experiment, not reconstructed afterward. This is where most lab processes fail: documentation that relies on memory is documentation that will be wrong.

Document as you go. Memory is not a research tool.

Publish

Results go on the experiment log regardless of outcome. Notable findings — particularly unexpected ones, significant failures, or results that contradict common assumptions — become longer-form articles on the Labs blog. The publication step is not optional.

Publication is part of the experiment, not a postscript.

Iterate

Failed experiments define the next question. A confirmed hypothesis opens up the next-order question. Abandoned experiments leave a documented record of the dead end so future researchers do not repeat it. Every experiment either confirms, denies, or redirects — all three outcomes are useful.

The cycle ends when we run out of questions. It has not ended yet.

Publication standards

What we publish and how.

Publication is not an afterthought — it is built into the experimental design. Before we run an experiment, we know what a publishable result looks like.

We publish everything

Completed experiments with confirmed or disconfirmed hypotheses
Failed experiments with honest failure analysis
Abandoned experiments with the reason for abandonment
Running experiments with current status and preliminary observations
Methodology updates and corrections when prior findings are revised

How we structure experiment logs

Experiment ID, stream, and date range
Hypothesis stated in falsifiable form
Methods: what we did, what tools, what data, what environment
Results: the numbers or findings, raw
Analysis: what the results mean and why
Key learnings: concrete takeaways transferable to other contexts
Next steps: what this opens or closes

How findings become articles

Findings that contradict common assumptions become deep-dive articles
Experiments that reveal architectural principles get expanded treatment
Failed experiments with broadly applicable failure patterns get their own posts
No experiment becomes an article purely because the results were positive

Internal peer review

Every experiment write-up is reviewed by at least one engineer not involved in running it
Reviewers check for: hypothesis clarity, method rigor, result interpretation, and conclusion validity
Disagreements are noted in the write-up, not resolved by omission
The review is documented alongside the experiment record

Reproducibility

Built to be repeated.

A finding that cannot be reproduced is not a finding. Every experiment we design is built with reproducibility as a constraint — not an aspiration.

This means documenting not just what we did, but what environment we did it in, what exact model versions we used, what data we worked with, and what parameters we set. A published experiment should be executable by anyone who reads it.

Where code and datasets are not proprietary, we open-source them. We release tools, prompts, and evaluation frameworks where we can. Research that lives only in a blog post is research that cannot be verified.

Model name and version pinned for every experiment

Prompts published verbatim in experiment logs

Evaluation datasets released under open license when possible

Environment specs (hardware, OS, dependency versions) documented

Code open-sourced under MIT where no proprietary constraint exists

Results tables include raw numbers, not just summarized conclusions

Example: environment spec

This is the level of documentation we include in every published experiment:

ExperimentEXP-004 — Document Classification

Date2025-08-14 → 2025-10-02

Modelclaude-3-5-sonnet-20241022

Temp0 (deterministic)

Eval set1,200 documents (invoices, POs, receipts)

BaselineRule-based classifier v2.1.3

HardwareSingle GPU instance, 8x A100

FrameworkPython 3.12 / LangChain 0.3.1

MetricTop-1 accuracy on held-out test split (20%)

Research ethics

The rules we do not break.

Applied AI research is young and the norms are still forming. We operate by principles that we think should be the standard — even when they are not convenient.

No cherry-picking

Results are reported in full. If an experiment produced mixed or contradictory results, we publish the full distribution — not the subset that supports the hypothesis. An honest negative result is worth more than a misleading positive one.

Transparent about limitations

Every experiment has scope constraints — sample sizes, controlled conditions, specific model versions, particular document types. We state these limitations explicitly. Findings that are true in our test conditions may not generalize; we say so.

Clear distinction between findings and recommendations

A finding is what the data showed. A recommendation is what we think practitioners should do based on that data. These are different things. We keep them clearly separated in all published work — a finding becomes a recommendation only after explicit reasoning.

Acknowledge prior work

When our findings confirm or contradict existing research or published work from others, we say so. We are not the first people to run experiments in applied AI. Situating our work in the broader landscape makes it more useful, not less.

See it in practice

Read the experiment log.

Every experiment we have run — completed, failed, and abandoned — is published in the experiment log. Methodology in action.

View experiment log →Read the blog