Prompt Testing

Unified testing for Insights, VoiceBot, and RAG prompts

Advanced Mode

Playground

A/B Compare

Batch Testing

Transcriber Eval

Version History

Prompt

Inject RAG context

Input

Output

{
  "sentiment": "positive",
  "confidence": 0.92,
  "key_phrases": [
    "I appreciate that",
    "thanks so much",
    "quick help",
    "process a refund immediately"
  ]
}

Run your prompt against an entire dataset to measure accuracy and performance. Select a prompt, model, and dataset, then click Run Batch.

Execution Configuration

Prompt

Model

GPT-4o: most accurate. GPT-4o-mini: faster, lower cost. Custom: your own endpoint.

Dataset

Execution Progress

Running

193 / 248 items processed 78%

Match Rate

91.2%

Percentage of outputs matching expected reference output

Avg Latency

1.4s

Average API response time per test item

Errors

Items that failed to produce a valid response

Est. Remaining

1m 12s

Estimated time to finish processing remaining items

Execution History

Run ID	Prompt	Model	Dataset	Match Rate	Status	Date
EX-047	Sentiment v2.3	GPT-4o	Sales Q4	91.2%	Running	Feb 24
EX-046	Topic v1.8	GPT-4o-mini	Support	87.4%	Completed	Feb 24
EX-045	Sentiment v2.3	Claude 3.5	Sales Q4	82.1%	Completed	Feb 23
EX-044	Summary v3.1	GPT-4o	VoiceBot Q1	68.9%	Failed	Feb 23

Execution EX-047

Sentiment Analysis v2.3 · GPT-4o · 248 items

Match Rate

87.1%

Avg Latency

1.2s

Total Cost

$0.47

Errors

Processing Pipeline

Load Dataset

0.3s

Run Prompts

4m 12s

Parse Outputs

1.8s

Evaluate

0.9s

Complete

Results Sample

Input	Expected	Actual	Confidence
Customer complained about billing error...	negative	negative	94%
Agent resolved the issue quickly and...	positive	positive	91%
The caller asked about available plans...	neutral	positive	62%
Customer was very upset about wait time...	negative	negative	97%
Agent offered a discount as compensation...	positive	neutral	54%
Standard greeting and identity verification...	neutral	neutral	88%

Error Breakdown

Parse Errors: 2 Timeout: 1 API Errors: 0

Parse Item #47: Response was not valid JSON — truncated at 4096 tokens

Parse Item #183: Missing required field "confidence" in response

Timeout Item #201: API request timed out after 30s

Compare transcription model performance side-by-side. Select a base model and a fine-tuned model, run them against the same test dataset, and analyze which domain-specific terms improved.

Evaluation Configuration

Model A

Model B

Test Dataset

Model A WER

18.4%

Model B WER

11.2%

Improvement

↓ 7.2%

Test Samples

245

Side-by-Side Comparison

#	Reference Text	Model A Output	Model B Output	A WER%	B WER%
1	The SLA compliance rate improved after implementing VoIP monitoring	The essay lay compliance rate improved after implementing VoIP monitoring	The SLA compliance rate improved after implementing VoIP monitoring	15%	0%
2	Please verify DTMF input before routing to IVR	Please verify DTM F input before routing to I V R	Please verify DTMF input before routing to IVR	22%	0%
3	The ARPU metric shows quarterly growth trends	The are poo metric shows quarterly growth trends	The ARPU metric shows quarterly growth trends	18%	0%
4	Configure the SIP trunk for redundant failover	Configure the sip trunk for redundant failover	Configure the SIP trunk for redundant failover	8%	0%
5	The customer's billing cycle resets on the fifteenth	The customer's billing cycle resets on the fifteenth	The customer's billing cycle resets on the fifteenth	0%	0%
6	TCPA regulations require opt-in consent for outbound calls	TCP a regulations require opt in consent for outbound calls	TCPA regulations require opt-in consent for outbound calls	12%	0%
7	Schedule a callback using the interactive menu system	Schedule a callback using the interactive menu system	Schedule a call back using the interactive menu system	0%	8%
8	The RMA process requires a valid warranty verification	The R M A process requires a valid warranty verification	The RMA process requires a valid warranty verification	10%	0%

Error Analysis

Most Improved Terms

SLA (was: “essay lay”) Fixed

DTMF (was: “DTM F”) Fixed

ARPU (was: “are poo”) Fixed

SIP (was: “sip”) Fixed

TCPA (was: “TCP a”) Fixed

IVR (was: “I V R”) Fixed

Remaining Errors

“callback” → “call back” Regression

Compound words with hyphens Known Issue