Labs · Architecture

How we architect AI-native systems.

Five years of building production AI systems taught us what works, what breaks under load, and what looks good in demos but fails at 3am. This documents the decisions we landed on and why.

Last updated: March 2026Version: 2.1Status: Living document

Core principles

What we believe about AI system design.

These are not aspirations. They are constraints we impose on every system we build. Violating them is how you end up with a system that works in staging and breaks in production.

Reliability over raw performance

A system that returns the right answer in 400ms is better than one that returns a maybe in 80ms. We optimise for correctness first, then throughput. Latency is a product constraint; wrong answers are a product failure.

Applies to every agent call, every tool execution, every output validation.

Observability first

Every LLM call is logged with its full input, output, cost, latency, and model version before it reaches production. You cannot debug a system you cannot see. We instrument before we optimise.

Structured logs shipped to a queryable store from day one.

Fail loudly, not silently

Silent degradation is the worst failure mode in AI systems. A hallucinated answer that looks correct is more dangerous than a visible error. Our systems raise explicit exceptions on validation failures — they do not patch over them.

Output schema violations trigger immediate retry or escalation, never silent pass-through.

Human review at decision boundaries

Autonomous agents are powerful in bounded task spaces. At any decision point that crosses a trust boundary — shipping code, modifying production data, sending external communication — a human approves. Always.

Not optional. Hardcoded into the pipeline.

Composable over monolithic

Large language models are not good at multi-step reasoning in a single prompt. We break complex tasks into small, testable, composable units — each with a clear input schema, a clear output schema, and a clear success condition.

A 10-step pipeline of simple agents outperforms one mega-prompt every time.

Reference architecture

How the pieces fit together.

A conceptual map of the system layers we use in production AI pipelines. Every real system adapts this to its specific requirements, but the layer structure stays consistent.

SocioFi AI System — Reference Architecture v2.1


  ┌─────────────────────────────────────────────────────────────────────┐
  │                        EXTERNAL INPUTS                              │
  │    User request │ Webhook │ Scheduled job │ API call                │
  └──────────────────────────────┬──────────────────────────────────────┘
                                 │
                                 ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │                      INPUT VALIDATION LAYER                         │
  │    Schema validation │ Auth check │ Rate limiting │ Sanitisation    │
  └──────────────────────────────┬──────────────────────────────────────┘
                                 │
                                 ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │                       TASK ROUTER / PLANNER                         │
  │                                                                     │
  │    ┌──────────────┐   ┌──────────────┐   ┌──────────────────────┐  │
  │    │  Task queue  │   │  Task type   │   │   Priority / budget  │  │
  │    │  (BullMQ)    │ → │  classifier  │ → │   enforcement        │  │
  │    └──────────────┘   └──────────────┘   └──────────────────────┘  │
  └──────────────────────────────┬──────────────────────────────────────┘
                                 │
                    ┌────────────┼────────────┐
                    ▼            ▼            ▼
  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
  │ Agent A  │  │ Agent B  │  │ Agent C  │  │ Agent N  │
  │ (Spec)   │  │ (Code)   │  │ (Review) │  │ (Deploy) │
  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘
       │              │              │              │
       └──────────────┴──────────────┴──────────────┘
                                 │
                                 ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │                      TOOL EXECUTION LAYER                           │
  │                                                                     │
  │    ┌─────────────┐  ┌────────────┐  ┌──────────┐  ┌────────────┐  │
  │    │  Code tools │  │  DB tools  │  │ File I/O │  │  Web tools │  │
  │    │  (run,lint) │  │  (query)   │  │  (read,  │  │  (search,  │  │
  │    └─────────────┘  └────────────┘  │  write)  │  │  fetch)    │  │
  │                                     └──────────┘  └────────────┘  │
  │    All tool calls: retried 3× with exponential backoff             │
  └──────────────────────────────┬──────────────────────────────────────┘
                                 │
                                 ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │                      MEMORY LAYER                                   │
  │                                                                     │
  │    Working memory    │  Episodic store   │  Semantic index          │
  │    (in-context)      │  (Redis/Postgres) │  (pgvector)              │
  └──────────────────────────────┬──────────────────────────────────────┘
                                 │
                                 ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │                  OUTPUT VALIDATION LAYER                            │
  │                                                                     │
  │    Zod schema parse → Pass → next stage                            │
  │                      → Fail → retry with error injected (max 3×)   │
  │                      → Fail after 3× → human escalation queue      │
  └──────────────────────────────┬──────────────────────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    ▼                         ▼
  ┌──────────────────────┐    ┌───────────────────────────────────────┐
  │  HUMAN REVIEW GATE   │    │         OBSERVABILITY LAYER            │
  │                      │    │                                        │
  │  Decision boundary:  │    │  Every LLM call logged:               │
  │  • Ship to prod      │    │  • Input/output (full)                │
  │  • Modify prod DB    │    │  • Cost + latency                     │
  │  • External comms    │    │  • Model + temperature                │
  │  • Auth changes      │    │  • Validation result                  │
  │                      │    │  • Agent + task context               │
  │  Human approves.     │    │                                        │
  │  Always.             │    │  Queryable. Alertable. Auditable.     │
  └──────────────────────┘    └───────────────────────────────────────┘

Technology decisions

What we use and why.

Not recommendations. These are the specific technology choices we made after testing alternatives in production, with the reasoning behind each.

Reasoning layer

Large language models for unstructured reasoning

LLMs excel at natural language understanding, code generation, and ambiguous instruction following. They are not calculators — we use them for what they are good at: interpreting intent and generating structured outputs from fuzzy inputs.

Structured outputsJSON modeTool callingTemperature: 0.1

Memory layer

Vector databases for semantic retrieval

LLMs have finite context windows. For systems that need to recall information across long sessions or large knowledge bases, we embed information into vectors and retrieve by semantic similarity. We do not stuff context windows.

pgvectorPineconeCosine similarityChunking strategies

Coordination layer

Message queues for agent orchestration

Multi-agent pipelines need durable, retryable task distribution. We use message queues rather than direct agent-to-agent calls so that failed tasks can be retried, work can be distributed across workers, and the system degrades gracefully under load.

BullMQRedisDead letter queuesPriority lanes

Output layer

Structured output validation with Zod schemas

Every agent output is validated against a typed schema before it enters the next pipeline stage. Invalid outputs trigger automatic retry with the validation error appended to the prompt. This eliminates an entire class of downstream failures.

ZodJSON SchemaAuto-retryValidation errors as prompts

Observability layer

Structured logs for every LLM interaction

Debugging AI systems without logs is guessing. We log every request and response — including cost, latency, model version, and validation outcome — in a structured format queryable by agent, task type, and failure mode.

Structured JSON logsCost trackingLatency percentilesError classification

Deployment layer

Human-gated CI for AI-generated code

AI-generated code does not ship automatically. It enters a review queue where an engineer reads, tests, and approves it. The pipeline automates the generation and testing; the human approves the merge. Speed without trust is a liability.

GitHub ActionsRequired reviewsAuto-test on PRStaging gates

Anti-patterns

What we learned not to do.

Every one of these we either did ourselves or saw break in a system we were brought in to fix. Documenting failures is more useful than documenting successes.

✗

Over-agentic systems

Giving a single agent responsibility for a long, multi-step workflow with external side effects — sending emails, modifying databases, deploying code — without checkpoints. When the agent misunderstands step 2, you find out at step 9 with irreversible consequences.

Break long tasks into short tasks. Add human checkpoints at state-changing boundaries.

✗

Shared mutable state between agents

Multiple agents writing to the same data structure without coordination creates race conditions and conflicting interpretations. We've seen agents overwrite each other's outputs, producing results that reflect neither agent's reasoning correctly.

Agents read shared state; they write to their own output slots. A coordinator merges.

✗

No observability in early builds

Skipping logging "to keep things simple" during development means the first production failure is undebuggable. Retrofitting observability into a running AI system is far harder than adding it from the start.

Log every LLM call before you write your first feature. Non-negotiable.

✗

100% automated deployment

Pipelines that generate code and auto-merge PRs without human review. We ran this experiment. AI-generated code that passes tests is still AI-generated code — it can be subtly wrong in ways tests don't catch.

Automate generation and testing. Keep the merge decision with a human.

✗

Prompt-only validation

Telling the LLM to "always return valid JSON" and hoping for the best. LLMs ignore instructions under certain input conditions. Validation must be in code, not in prompts.

Parse and validate every output in code. Treat prompts as hints, not guarantees.

Go deeper

How we architect AI-native systems.

What we believe about AI system design.

How the pieces fit together.

What we use and why.

What we learned not to do.

Related reading.