← case studies

vibecheck — a YAML-first eval framework for any LLM

An agent-evaluation framework built around a simple YAML DSL. Compare models, save suites, mix string matching with semantic and LLM-judge checks, and run multi-model evals from the command line. Open source CLI, hosted service in invite-only preview at vibescheck.io.

role
Solo build — design, engineering, release
date
October 6, 2025
stack
TypeScriptnpm CLIYAML DSLMulti-provider LLM (OpenRouter)Semantic + LLM-judge checksClaude Code skill / MCP testing

the problem

Evals are how you stop debugging-by-vibe and start shipping with confidence — but the tooling forces a choice between heavyweight platforms (good, but a commitment) and ad-hoc scripts (fast, but disposable). For a solo builder or a small team that just needs to ask “did this prompt change make things better or worse, on which models?”, the gap is real.

what shipped

vibecheck — a CLI with a YAML DSL designed for the tightest possible iteration loop.

metadata:
  name: hello-world
  model: anthropic/claude-3.5-sonnet

evals:
  - prompt: Say hello
    checks:
      - match: "*hello*"
      - min_tokens: 1
      - max_tokens: 50
vibe check -f hello-world.yaml

Things that fell out of that DSL:

  • Multiple check kinds in the same suite. Glob match, semantic similarity, llm_judge quality assessments, token bounds. Cheap checks short-circuit before expensive ones.
  • Multi-model comparison. -m "openai*,anthropic*" runs the same suite across providers. Results sort by price-performance.
  • Suites, variables, and secrets as first-class CLI objects. vibe set, vibe get, vibe var set, vibe secret set. Re-runs are trivial; secrets stay write-only.
  • Claude Code skill + MCP-tool eval support. Run evals from inside an agent loop, or eval an MCP tool the same way you’d eval a prompt.

what changed

The eval loop dropped from “set up a project” to “write five lines of YAML.” For my own work I now run a vibe check before merging anything that touches a prompt. The same suite I write for myself doubles as the regression check an agent can run autonomously.

the lesson

A good DSL beats a good UI when the user is sometimes a human, sometimes an agent. YAML is boring on purpose — it’s the most stable thing both audiences already know how to read and write.

Source: github.com/hev/vibecheck · API keys at vibescheck.io.

Start typing to search.