vibecheck — a YAML-first eval framework for any LLM

the problem

Evals are how you stop debugging-by-vibe and start shipping with confidence — but the tooling forces a choice between heavyweight platforms (good, but a commitment) and ad-hoc scripts (fast, but disposable). For a solo builder or a small team that just needs to ask “did this prompt change make things better or worse, on which models?”, the gap is real.

what shipped

vibecheck — a CLI with a YAML DSL designed for the tightest possible iteration loop.

metadata:
  name: hello-world
  model: anthropic/claude-3.5-sonnet

evals:
  - prompt: Say hello
    checks:
      - match: "*hello*"
      - min_tokens: 1
      - max_tokens: 50

vibe check -f hello-world.yaml

Things that fell out of that DSL:

Multiple check kinds in the same suite. Glob match, semantic similarity, llm_judge quality assessments, token bounds. Cheap checks short-circuit before expensive ones.
Multi-model comparison. -m "openai*,anthropic*" runs the same suite across providers. Results sort by price-performance.
Suites, variables, and secrets as first-class CLI objects. vibe set, vibe get, vibe var set, vibe secret set. Re-runs are trivial; secrets stay write-only.
Claude Code skill + MCP-tool eval support. Run evals from inside an agent loop, or eval an MCP tool the same way you’d eval a prompt.

what changed

The eval loop dropped from “set up a project” to “write five lines of YAML.” For my own work I now run a vibe check before merging anything that touches a prompt. The same suite I write for myself doubles as the regression check an agent can run autonomously.

the lesson

A good DSL beats a good UI when the user is sometimes a human, sometimes an agent. YAML is boring on purpose — it’s the most stable thing both audiences already know how to read and write.

Source: github.com/hev/vibecheck · API keys at vibescheck.io.