Version your prompts, define test cases, and run A/B evals across OpenAI and Anthropic. Catch regressions before they reach your users.
Also available as npm install @phasio/sdk
Install the SDK and run evals in code — no GUI required.
import { Phasio, contains, matches, llmJudge } from '@phasio/sdk';
const pe = new Phasio({
apiKey: 'pe-xxxx',
providers: [
{ provider: 'openai', llmKey: 'sk-...', model: 'gpt-4o-mini' },
{ provider: 'anthropic', llmKey: 'sk-ant-...', model: 'claude-haiku-4-5-20251001' },
],
});
const result = await pe.compare({
versions: [
{ label: 'v1', template: 'Summarize: {{input}}' },
{ label: 'v2', template: 'Brief summary of: {{input}}' },
],
tests: [
{ input: 'The quick brown fox...', expect: contains('fox') },
{ input: 'What is 2+2?', expect: matches(/\b4\b/) },
{ input: 'Write a haiku', expect: llmJudge('Valid 5-7-5 haiku') },
],
});Phasio
────────────────────────────────────────
1 provider · 2 versions · 3 tests
openai (gpt-4o-mini)
────────────────────────────────────────
v1 v2
case 1 ✓ 821ms ✓ 743ms
case 2 ✓ 654ms ✓ 612ms
case 3 ✓ 1.2s ✓ 980ms
score 100% 100%
latency 891ms avg 778ms avg
= Tie on accuracy — v2 faster (778ms avg)
Track every change to your prompts. Compare v1 vs v2 side by side with full history.
Run rule-based and LLM-judge evals against your test suites. Get pass/fail per case.
See exactly which test cases regressed between versions before you deploy.
Define natural language criteria. Let GPT or Claude score your outputs automatically.
Run the same suite against OpenAI and Anthropic in parallel. Compare providers side by side.
Track score trends over time. Get regression alerts when a prompt drops in performance.
Write your prompt template using {{input}} as the variable. Every change creates a new version automatically.
Add inputs and expected behaviors. Use contains, regex, not_contains checks or LLM judge for nuanced scoring.
Select two versions and a test suite on the dashboard, or run via npm install @phasio/sdk in your CI pipeline.
See exactly what improved and what regressed. Deploy only when your score goes up.
Start testing your prompts in minutes. No credit card required.
npm install @phasio/sdk