Version Control Freak
TGIF! Thank god it’s features, here’s what we shipped this week:
Datasets just got serious with Dataset Versioning—every change tracked, every version referenceable, no more “which dataset did we eval against?” Meanwhile, Replay Trace in Arena lets you re-run any production trace through Arena to compare models side-by-side on real traffic. And for the compliance-minded: Audit Logs are here.

Added
- Dataset Versioning - Datasets now have full version history. Every edit, every addition tracked—so you always know exactly what you evaluated against. No more version of events that doesn’t add up.
- Replay Trace in Arena - Take any production trace and replay it in Arena. Compare how different models handle the same real-world input, side by side. It’s the replay value you’ve been waiting for.
- Audit Logs - Full visibility into who did what, and when. Every action logged, every change accounted for. Your compliance team just breathed a sigh of relief.
MC…What?!!
TGIF! Thank god it’s features, here’s what we shipped this week:
Headline first: Confident AI now has an MCP server (open-sourced on github)—plug your evals, datasets, and traces into any MCP-compatible client. Also shipping this week: automatic dataset curation from production traces, a wave of Observatory upgrades (custom column variable mapping, annotation tabs, category filters, metric columns), and PagerDuty for alerts.

Added
- MCP Server - Plug Confident AI into any MCP-compatible client. Your evals, datasets, and traces—accessible from wherever you already work. The model context protocol is served.
- Automatic Dataset Curation from Traces & Spans - The big one. Turn production traces and spans into curated datasets automatically. Your best (and worst) real-world examples, ready for eval—no manual curation required. Let your data curate itself.
- Annotation Tabs in Observatory - Annotations now live in their own tabs, so you can flip between views without losing context. We’re keeping tabs on your feedback.
- PagerDuty Integration for Alerts - Route alerts straight to PagerDuty so the right people get paged at the right time. On-call never looked so connected.
- Custom Column Variable Mapping - Map variables directly to custom columns in Observatory. Your data, your layout—no more squinting at mismatched fields. Finally, everything maps out.
- Category Filters & Metric/Annotation Column Options - Filter by category and toggle metric or annotation columns on and off. Observatory now lets you see exactly what matters—no more, no less. Filter out the noise.
Changed
- General Stability & Performance Improvements - Faster loads, fewer hiccups, smoother everything. The kind of changes you feel more than you see.
Prompt-ly Evaluated
TGIF! Thank god it’s features, here’s what we shipped this week:
Headline first: Prompt Evals are here. Think GitHub Actions, but for prompt commits and version releases—so every prompt change can trigger the checks that keep quality high and surprises low.

Added
- Prompt Evals - The big one. Run evals on prompt commits and version releases automatically—CI for prompts, not vibes-based QA.
- Support Ticket Submission Page - Need help? There’s now a dedicated place to ask for it.
- Trace Classification - Sort your traces into categories. Less chaos, more class.
- Dataset Threads + Scenario Generation - Datasets now support threads, with scenario generation to spin up richer test cases.
- Portkey Support in Arena and Experiments - Portkey now works in Arena and Experiments. The key to connected workflows.
- SSE + HTTP Streaming for AI Connections - Stream responses over SSE or HTTP. Go with the flow.
- Azure Key Vault Integration - Store secrets in Azure Key Vault. Your keys, under lock and cloud.
- Org Settings Pages - New pages for roles, permissions, and API keys. Access control, finally under control.
Changed
- Evaluate Buttons on Traces and Spans - Trigger evals directly from where issues appear, so troubleshooting is fewer clicks and more signal.
We Need to Talk. In Code.
TGIF! Thank god it’s features, here’s what we shipped this week:
Big week for the org-anized among us. Multi-turn evals go code-first, Vercel joins the family, and prompts finally get the observability they deserve.

Added
- Code-Based Multi-Turn Evals - Introducing
ConversationalTestCasefor your codebase. All the power of multi-turn evaluation, now programmable. Time to have the talk with your chatbot—in code. - Vercel AI SDK Integration - Next.js devs, rejoice! Native integration with Vercel’s AI SDK means you can trace and evaluate your
aipackage calls with zero friction. Ship fast, eval faster. - Transformers on Retrievers & Tools - Transformers aren’t just for AI connection outputs anymore. Reshape retriever outputs and tool calls before evaluation. Your agentic RAG pipeline called—it wants its custom parsing back.
- Organization-Wide Metrics - Define metrics at the org level and share them across all your teams. No more “wait, which faithfulness config are we using?” Standardize once, evaluate everywhere.
Changed
- Prompt Observability - Track which prompts are running in production, when they were swapped, and how performance changed. Finally, prompt feedback on your prompts.
More Than Meets the AI
TGIF! Thank god it’s features, here’s what we shipped this week:
Transformers (Beta) are here and they’re truly more than meets the AI. Reshape your traced data before evaluation—because not every trace deserves the full spotlight. Meanwhile, Prompt Studio just got a serious commit-ment upgrade with git-style versioning. Love is in the diff this Valentine’s weekend.

Added
- Transformers (Beta) - The biggest release this week, and it’s more than meets the eye. Write custom code to transform your traced data—including individual spans—before evaluation. Don’t want the whole trace? No problem. Cherry-pick exactly what matters.
- Transformers on AI Connections - Got a JSON blob coming back from your model? Negative indexes on a list? Transformers let you parse and wrangle AI connection outputs however you need. Your data, your rules.
- Prompt Commits - Every change to your prompt now creates a commit. Full history, no more guessing what changed or when. It’s
git logfor your prompts, and it’s beautiful.
Changed
- Git-Based Prompt Studio - Prompt Studio is leaning hard into the git workflow. Commits, versions, diffs—everything you love about version control, now for your prompts. We’re committing to this direction. (Pun intended.)
Let There Be Light (Mode)
TGIF! Thank god it’s features, here’s what we shipped this week:
Big week for visibility—both in your data and on your screen. We’re launching 30+ additional Observatory graphs to surface insights, a Data Usage settings page for full transparency, and light mode is officially out of beta. Shine bright, friends.

Added
- Data Usage Settings Page - Know thy data. A dedicated page to see exactly how your data is being used—because transparency isn’t just a buzzword, it’s a lifestyle.
- Observatory Graphs - Finally, charts that slap. Visualize your observability data, spot trends before they spot you, and look like a genius in your next standup.
- Code Evals (Beta) - G-Eval couldn’t cut it? Write your own eval logic in code. We don’t judge. Okay, technically we do—that’s the whole point.
- Multimodal Arena - Let your vision-language models duke it out. Two models enter, one model leaves with bragging rights.
- AI Connection Upgrades - Tracing, list indexes key path, duplicate connections, max concurrency—the works. Your AI connections just got a glow-up.
Changed
- Light Mode Out of Beta - Light mode is officially here to stay. Welcome to the bright side.
- Faster Observatory Dashboards - We gave our dashboards a double espresso. Load times are now unreasonably fast.
Scaling New Heights
TGIF! Thank god it’s features, here’s what we shipped this week:
Welcome to our brand new changelog! We’re kicking things off with better cost tracking, reliability improvements, and some serious scalability upgrades.

Changelogs before this point are backfilled!
Added
- Changelog - You’re reading it! Subscribe to never miss a beat.
- Custom Model Costs - Set custom cost-per-token for any model in your project settings. Finally, accurate cost tracking for fine-tuned and self-hosted models.
- Request Timeout for AI Connections - Configure timeout limits for your LLM connections. No more hanging requests.
- High-Volume Trace Ingestion - We’ve beefed up our trace handling with buffered ingestion. Traffic spikes? Bring ‘em on.
Changed
- Smoother Experiment Runs - Real-time evaluation progress is now more reliable with improved streaming.
- Annotator Attribution - See who left that annotation. Credit where credit’s due.
- Faster Spans Loading - The spans tab now loads at lightning speed, even for trace-heavy projects.
Alert the Press, We’re Going Multimodal
TGIF! Thank god it’s features, here’s what we shipped this week:
Big week! We’re introducing alerts to keep you in the loop, shareable traces for collaboration, and multimodal support so your vision models don’t feel left out.

Added
- Public Trace Links - Share traces with anyone via a public link. Perfect for debugging with teammates or showing off to stakeholders.
- Scheduled Alerts - Set thresholds, get notified. Never let a regression slip through unnoticed again.
- Multimodal Evaluations - Images + text? We can evaluate that now. Test your vision-language models with confidence.
- Evaluation Queue - Large eval jobs now queue up nicely instead of timing out. Go big or go home.
Changed
- Snappier Dashboards - Graphs load faster. Like, noticeably faster. You’re welcome.
On Cloud Nine
TGIF! Thank god it’s features, here’s what we shipped this week:
Azure fans, GCP enthusiasts—we see you. This week we’re bringing the clouds to Confident AI so you can evaluate using your own infrastructure.

Added
- Azure OpenAI Support - Connect your Azure deployment and run evals without leaving your cloud comfort zone.
- GCP Vertex AI Integration - Drop in your service account key and you’re off to the races with Google’s models.
- Top-K Filtering - Show me the top 10. Or bottom 5. Or whatever K your heart desires.
Changed
- Faster Dashboards - We optimized the heck out of our aggregation layer. Graphs now load before you finish your sip of coffee.
- Live Evaluation Progress - Watch your evals run in real-time with streaming progress updates. It’s oddly satisfying.
Dashing Into the New Year
TGIF! Thank god it’s features, here’s what we shipped this week:
New year, new dashboards! We’ve redesigned how you visualize your LLM performance with customizable views and smarter breakdowns. And while you’re at it, you can now take your security insights with you as PDF reports.

Added
- Custom Dashboards - Build your own views. Save them. Make them yours. Finally, analytics that fit how you work.
- Dimension Breakdowns - Slice and dice by model, environment, or any dimension. Compare apples to apples (or GPT-4 to Claude).
- Risk Assessment Reports (PDF) - Generate custom risk assessment reports from your red teaming runs and download them as shareable PDFs. Perfect for reviews, audits, and internal security discussions. (Keep it confidential)
Changed
- Fresh Dashboard Layout - Everything’s been reorganized for better flow. Less clicking, more insights.
- Readable Timestamps - Dates and times now look like actual dates and times. Revolutionary, we know.