Prompt Testing
Unified testing for Insights, VoiceBot, and RAG prompts
Advanced Mode
Playground
A/B Compare
Batch Testing
Transcriber Eval
Version History
Prompt
Input
Output
{
"sentiment": "positive",
"confidence": 0.92,
"key_phrases": [
"I appreciate that",
"thanks so much",
"quick help",
"process a refund immediately"
]
}
A/B Compare coming soon
Run your prompt against an entire dataset to measure accuracy and performance. Select a prompt, model, and dataset, then click Run Batch.
Execution Configuration
GPT-4o: most accurate. GPT-4o-mini: faster, lower cost. Custom: your own endpoint.
Execution Progress
Running
193 / 248 items processed
78%
Match Rate
91.2%
Percentage of outputs matching expected reference output
Avg Latency
1.4s
Average API response time per test item
Errors
3
Items that failed to produce a valid response
Est. Remaining
1m 12s
Estimated time to finish processing remaining items
Execution History
| Run ID | Prompt | Model | Dataset | Match Rate | Status | Date |
|---|---|---|---|---|---|---|
| EX-047 | Sentiment v2.3 | GPT-4o | Sales Q4 | 91.2% | Running | Feb 24 |
| EX-046 | Topic v1.8 | GPT-4o-mini | Support | 87.4% | Completed | Feb 24 |
| EX-045 | Sentiment v2.3 | Claude 3.5 | Sales Q4 | 82.1% | Completed | Feb 23 |
| EX-044 | Summary v3.1 | GPT-4o | VoiceBot Q1 | 68.9% | Failed | Feb 23 |
Execution EX-047
Sentiment Analysis v2.3 · GPT-4o · 248 items
Match Rate
87.1%
Avg Latency
1.2s
Total Cost
$0.47
Errors
3
Processing Pipeline
Load Dataset
0.3s
Run Prompts
4m 12s
Parse Outputs
1.8s
Evaluate
0.9s
5
Complete
Results Sample
| Input | Expected | Actual | Match | Confidence |
|---|---|---|---|---|
| Customer complained about billing error... | negative | negative | 94% | |
| Agent resolved the issue quickly and... | positive | positive | 91% | |
| The caller asked about available plans... | neutral | positive | 62% | |
| Customer was very upset about wait time... | negative | negative | 97% | |
| Agent offered a discount as compensation... | positive | neutral | 54% | |
| Standard greeting and identity verification... | neutral | neutral | 88% |
Error Breakdown
Parse Errors: 2
Timeout: 1
API Errors: 0
Parse Item #47: Response was not valid JSON — truncated at 4096 tokens
Parse Item #183: Missing required field "confidence" in response
Timeout Item #201: API request timed out after 30s
Compare transcription model performance side-by-side. Select a base model and a fine-tuned model, run them against the same test dataset, and analyze which domain-specific terms improved.
Evaluation Configuration
Model A WER
18.4%
Model B WER
11.2%
Improvement
↓ 7.2%
Test Samples
245
Side-by-Side Comparison
| # | Reference Text | Model A Output | Model B Output | A WER% | B WER% |
|---|---|---|---|---|---|
| 1 | The SLA compliance rate improved after implementing VoIP monitoring | The essay lay compliance rate improved after implementing VoIP monitoring | The SLA compliance rate improved after implementing VoIP monitoring | 15% | 0% |
| 2 | Please verify DTMF input before routing to IVR | Please verify DTM F input before routing to I V R | Please verify DTMF input before routing to IVR | 22% | 0% |
| 3 | The ARPU metric shows quarterly growth trends | The are poo metric shows quarterly growth trends | The ARPU metric shows quarterly growth trends | 18% | 0% |
| 4 | Configure the SIP trunk for redundant failover | Configure the sip trunk for redundant failover | Configure the SIP trunk for redundant failover | 8% | 0% |
| 5 | The customer's billing cycle resets on the fifteenth | The customer's billing cycle resets on the fifteenth | The customer's billing cycle resets on the fifteenth | 0% | 0% |
| 6 | TCPA regulations require opt-in consent for outbound calls | TCP a regulations require opt in consent for outbound calls | TCPA regulations require opt-in consent for outbound calls | 12% | 0% |
| 7 | Schedule a callback using the interactive menu system | Schedule a callback using the interactive menu system | Schedule a call back using the interactive menu system | 0% | 8% |
| 8 | The RMA process requires a valid warranty verification | The R M A process requires a valid warranty verification | The RMA process requires a valid warranty verification | 10% | 0% |
Error Analysis
Most Improved Terms
SLA (was: “essay lay”) Fixed
DTMF (was: “DTM F”) Fixed
ARPU (was: “are poo”) Fixed
SIP (was: “sip”) Fixed
TCPA (was: “TCP a”) Fixed
IVR (was: “I V R”) Fixed
Remaining Errors
“callback” → “call back” Regression
Compound words with hyphens Known Issue
Version History coming soon