Standardized benchmarks for testing multi-agent coordination, tool use reliability, and failure recovery. 47 built-in test scenarios covering common failure modes in production agent systems.
We build things for our own work, then release them. No strings. MIT or Apache licensed. These tools run in our own production systems before anyone else touches them.
Standardized benchmarks for testing multi-agent coordination, tool use reliability, and failure recovery. 47 built-in test scenarios covering common failure modes in production agent systems.
Detects and blocks prompt injection attempts with <2ms latency. 94% accuracy on our benchmark dataset. Supports custom allowlists and configurable sensitivity thresholds for different deployment contexts.
Compare retrieval strategies, chunking methods, and embedding models on standardized datasets. Used by 200+ teams to evaluate RAG configurations before production deployment.
Convert natural language specifications into executable test suites. Integrates with Jest and Vitest. Extracts behavioral requirements from specs and generates property-based and example-based tests.
Annotated real-world business documents for training and evaluating automation models. 50K+ examples across invoices, purchase orders, contracts, and insurance forms. CC-BY licensed.
Trace, visualize, and debug complex agent pipelines. OpenTelemetry compatible. Captures tool call inputs and outputs, latency at each step, and token usage per agent. Works with any orchestration framework.
Every project has a CONTRIBUTING.md with setup instructions and contribution guidelines. We review pull requests from the community and typically respond within 3 business days.
main. Name it descriptively.Every project in this catalog started as internal tooling. We built agent-eval because we needed a standardized way to measure tool use reliability across client systems. We built prompt-guard because we were handling prompt injection incidents manually and it was not sustainable. We built flow-tracer because debugging production agent pipelines without observability was costing us hours per incident.
We release under permissive licenses (MIT, Apache 2.0, CC-BY 4.0) because we believe the tooling ecosystem for AI engineering is still early and fragmented. Keeping useful tools proprietary slows the whole field down. There is nothing in these repos that is a competitive differentiator for us — our advantage is in how we apply these tools, not in the tools themselves.
We maintain these repositories because we still use them. If we stop using a tool internally, we will archive the repository and say so clearly. We will not let projects rot silently.