Now in beta

Test your prompts before they break production

Version your prompts, define test cases, and run A/B evals across OpenAI and Anthropic. Catch regressions before they reach your users.

Also available as npm install @phasio/sdk

Works in your CI/CD pipeline

Install the SDK and run evals in code — no GUI required.

eval.ts
import { Phasio, contains, matches, llmJudge } from '@phasio/sdk';

const pe = new Phasio({
  apiKey: 'pe-xxxx',
  providers: [
    { provider: 'openai', llmKey: 'sk-...', model: 'gpt-4o-mini' },
    { provider: 'anthropic', llmKey: 'sk-ant-...', model: 'claude-haiku-4-5-20251001' },
  ],
});

const result = await pe.compare({
  versions: [
    { label: 'v1', template: 'Summarize: {{input}}' },
    { label: 'v2', template: 'Brief summary of: {{input}}' },
  ],
  tests: [
    { input: 'The quick brown fox...', expect: contains('fox') },
    { input: 'What is 2+2?',          expect: matches(/\b4\b/) },
    { input: 'Write a haiku',          expect: llmJudge('Valid 5-7-5 haiku') },
  ],
});
terminal
Phasio
────────────────────────────────────────
1 provider · 2 versions · 3 tests

openai (gpt-4o-mini)
────────────────────────────────────────
         v1             v2
case 1   ✓ 821ms        ✓ 743ms
case 2   ✓ 654ms        ✓ 612ms
case 3   ✓ 1.2s         ✓ 980ms

score    100%           100%
latency  891ms avg      778ms avg

= Tie on accuracy — v2 faster (778ms avg)

Everything you need to ship prompts with confidence

Prompt Versioning

Track every change to your prompts. Compare v1 vs v2 side by side with full history.

Eval Runner

Run rule-based and LLM-judge evals against your test suites. Get pass/fail per case.

Diff Reports

See exactly which test cases regressed between versions before you deploy.

LLM Judge

Define natural language criteria. Let GPT or Claude score your outputs automatically.

Multi-Provider

Run the same suite against OpenAI and Anthropic in parallel. Compare providers side by side.

Analytics

Track score trends over time. Get regression alerts when a prompt drops in performance.

How it works

01

Create a prompt and version it

Write your prompt template using {{input}} as the variable. Every change creates a new version automatically.

02

Define your test cases

Add inputs and expected behaviors. Use contains, regex, not_contains checks or LLM judge for nuanced scoring.

03

Run an eval — web or SDK

Select two versions and a test suite on the dashboard, or run via npm install @phasio/sdk in your CI pipeline.

04

Ship with confidence

See exactly what improved and what regressed. Deploy only when your score goes up.

Ready to stop guessing?

Start testing your prompts in minutes. No credit card required.

Get started for free →npm install @phasio/sdk