You Shall Not Merge!!!
TGIF! Thank god it’s features, here’s what we shipped this week:
The one you’ve been holding your breath for: Prompt Pull Requests & Approval Workflows are finally live—raise a PR on your prompt branch, let reviewers inspect diffs and eval results before signing off, and get a full audit trail of every change. AI Connections also got a major upgrade: a Postman-style layout, Auth0 and HMAC authorization, and direct trace linking to individual turns in multi-turn test runs. Plus: Thread Categorization with a configurable sample rate, and red teaming progress bars with more progress.

Added
- Prompt Pull Requests & Approval Workflows - Raise a PR on any prompt branch. Reviewers see diffs and eval results side by side before approving, and every merge leaves a full audit trail of every change. Prompt engineering, meet version-control discipline. Approved.
- AI Connection Authorization - AI Connections now support Auth0 SSO and HMAC signing. Secure your connections without the overhead. Consider it auth-orized.
- Trace Linking to Turns in Multi-Turn Test Runs - AI Connections now link traces directly to individual turns within multi-turn test runs. Full visibility at every step of the conversation. The turn you’ve been waiting for.
- Thread Categorization - Automatically categorize your threads to understand what your users are actually talking about. Set a sample rate to control how much traffic gets categorized. Categorically useful.
Changed
- New AI Connection Layout - AI Connections get a Postman-inspired makeover: clean, familiar, and built for how you already think about API calls. Connect in style.
- Improved Red Teaming Progress Bars - Progress bars for red teaming jobs got a polish pass—more granular, more informative, no more guessing how far along you are. Watch every step of your risk assessment unfold. Progress has definitely been made.
Branching Out
TGIF! Thank god it’s features, here’s what we shipped this week:
Buckle up—this is a big one. Prompt Branches bring proper version-control workflows to your prompts: branch, iterate, and merge without touching production. Custom Dashboards let you build your own Observatory views from scratch. Plus: OpenRouter and TrueFoundry are now available in Arena and Experiments, OpenInference tracing lands for Python and TypeScript, and enterprise auth gets a serious upgrade with HMAC & Auth0 support.

Added
- Prompt Branches - Branch off your prompts, iterate safely, and merge back when you’re ready. Your prompt engineering, with the same version-control discipline as your code. A real branch upgrade.
- Custom Dashboards - Build your own Observatory dashboards from scratch. Pick your metrics, arrange your panels, tell your data’s story. Your observatory, your _dash_board.
- OpenRouter & TrueFoundry in Arena & Experiments - Two new model providers, one week. Access hundreds of models through OpenRouter or bring your fine-tuned TrueFoundry models—all available in Arena and Experiments. The route to more models just got shorter.
- OpenInference Integration - Trace your LLM apps with OpenInference in both Python and TypeScript. Plug in, light up, see everything. Openly invited.
- HMAC & Auth0 Support - Enterprise-grade authentication with HMAC signing and Auth0 SSO. Security that doesn’t slow you down. Consider this auth-orized.
- New Thread Displayer - Threads get a brand-new visual treatment—cleaner, faster, and easier to follow multi-turn conversations. Threads have never been so well-threaded.
- AI Connections for Quick Runs & Experiments - Connect your AI provider directly for Quick Runs, and fine-tune temperature, top-p, and more right from the Arena and Experiments panel. No config files, no detours. Quick on the draw.
- Error Bars in Observatory - Metrics now show confidence intervals so you know how much to trust the numbers. Finally, some margin for error.
- Progress Bars for Risk Assessments - Red teaming jobs now show real-time progress instead of a spinner. Watch the risk assessment unfold. Progress has been made.
Changed
- Transformers & Categories out of Beta - Battle-tested and production-ready. No more beta disclaimers—officially official.
- User Analytics Upgrades - Total cost per user in the table, User ID filter on the Threads page, and click-through from Users to Traces. Your users, accounted for.
- New Pagination & Arrow Navigation - Smoother pagination across the platform and arrow-key navigation for Spans and Threads. Keyboard warriors, we’re turning the page for you.
- Framework Deletion - You can now delete frameworks you no longer need. Sometimes you just need to let go.
- General Stability & Performance Improvements - Bug fixes, reliability boosts, and the usual behind-the-scenes polish. The kind of changes you feel more than you see.
Version Control Freak
TGIF! Thank god it’s features, here’s what we shipped this week:
Datasets just got serious with Dataset Versioning—every change tracked, every version referenceable, no more “which dataset did we eval against?” Meanwhile, Replay Trace in Arena lets you re-run any production trace through Arena to compare models side-by-side on real traffic. And for the compliance-minded: Audit Logs are here.

Added
- Dataset Versioning - Datasets now have full version history. Every edit, every addition tracked—so you always know exactly what you evaluated against. No more version of events that doesn’t add up.
- Replay Trace in Arena - Take any production trace and replay it in Arena. Compare how different models handle the same real-world input, side by side. It’s the replay value you’ve been waiting for.
- Audit Logs - Full visibility into who did what, and when. Every action logged, every change accounted for. Your compliance team just breathed a sigh of relief.
MC…What?!!
TGIF! Thank god it’s features, here’s what we shipped this week:
Headline first: Confident AI now has an MCP server (open-sourced on github)—plug your evals, datasets, and traces into any MCP-compatible client. Also shipping this week: automatic dataset curation from production traces, a wave of Observatory upgrades (custom column variable mapping, annotation tabs, category filters, metric columns), and PagerDuty for alerts.

Added
- MCP Server - Plug Confident AI into any MCP-compatible client. Your evals, datasets, and traces—accessible from wherever you already work. The model context protocol is served.
- Automatic Dataset Curation from Traces & Spans - The big one. Turn production traces and spans into curated datasets automatically. Your best (and worst) real-world examples, ready for eval—no manual curation required. Let your data curate itself.
- Annotation Tabs in Observatory - Annotations now live in their own tabs, so you can flip between views without losing context. We’re keeping tabs on your feedback.
- PagerDuty Integration for Alerts - Route alerts straight to PagerDuty so the right people get paged at the right time. On-call never looked so connected.
- Custom Column Variable Mapping - Map variables directly to custom columns in Observatory. Your data, your layout—no more squinting at mismatched fields. Finally, everything maps out.
- Category Filters & Metric/Annotation Column Options - Filter by category and toggle metric or annotation columns on and off. Observatory now lets you see exactly what matters—no more, no less. Filter out the noise.
Changed
- General Stability & Performance Improvements - Faster loads, fewer hiccups, smoother everything. The kind of changes you feel more than you see.
Prompt-ly Evaluated
TGIF! Thank god it’s features, here’s what we shipped this week:
Headline first: Prompt Evals are here. Think GitHub Actions, but for prompt commits and version releases—so every prompt change can trigger the checks that keep quality high and surprises low.

Added
- Prompt Evals - The big one. Run evals on prompt commits and version releases automatically—CI for prompts, not vibes-based QA.
- Support Ticket Submission Page - Need help? There’s now a dedicated place to ask for it.
- Trace Classification - Sort your traces into categories. Less chaos, more class.
- Dataset Threads + Scenario Generation - Datasets now support threads, with scenario generation to spin up richer test cases.
- Portkey Support in Arena and Experiments - Portkey now works in Arena and Experiments. The key to connected workflows.
- SSE + HTTP Streaming for AI Connections - Stream responses over SSE or HTTP. Go with the flow.
- Azure Key Vault Integration - Store secrets in Azure Key Vault. Your keys, under lock and cloud.
- Org Settings Pages - New pages for roles, permissions, and API keys. Access control, finally under control.
Changed
- Evaluate Buttons on Traces and Spans - Trigger evals directly from where issues appear, so troubleshooting is fewer clicks and more signal.
We Need to Talk. In Code.
TGIF! Thank god it’s features, here’s what we shipped this week:
Big week for the org-anized among us. Multi-turn evals go code-first, Vercel joins the family, and prompts finally get the observability they deserve.

Added
- Code-Based Multi-Turn Evals - Introducing
ConversationalTestCasefor your codebase. All the power of multi-turn evaluation, now programmable. Time to have the talk with your chatbot—in code. - Vercel AI SDK Integration - Next.js devs, rejoice! Native integration with Vercel’s AI SDK means you can trace and evaluate your
aipackage calls with zero friction. Ship fast, eval faster. - Transformers on Retrievers & Tools - Transformers aren’t just for AI connection outputs anymore. Reshape retriever outputs and tool calls before evaluation. Your agentic RAG pipeline called—it wants its custom parsing back.
- Organization-Wide Metrics - Define metrics at the org level and share them across all your teams. No more “wait, which faithfulness config are we using?” Standardize once, evaluate everywhere.
Changed
- Prompt Observability - Track which prompts are running in production, when they were swapped, and how performance changed. Finally, prompt feedback on your prompts.
More Than Meets the AI
TGIF! Thank god it’s features, here’s what we shipped this week:
Transformers (Beta) are here and they’re truly more than meets the AI. Reshape your traced data before evaluation—because not every trace deserves the full spotlight. Meanwhile, Prompt Studio just got a serious commit-ment upgrade with git-style versioning. Love is in the diff this Valentine’s weekend.

Added
- Transformers (Beta) - The biggest release this week, and it’s more than meets the eye. Write custom code to transform your traced data—including individual spans—before evaluation. Don’t want the whole trace? No problem. Cherry-pick exactly what matters.
- Transformers on AI Connections - Got a JSON blob coming back from your model? Negative indexes on a list? Transformers let you parse and wrangle AI connection outputs however you need. Your data, your rules.
- Prompt Commits - Every change to your prompt now creates a commit. Full history, no more guessing what changed or when. It’s
git logfor your prompts, and it’s beautiful.
Changed
- Git-Based Prompt Studio - Prompt Studio is leaning hard into the git workflow. Commits, versions, diffs—everything you love about version control, now for your prompts. We’re committing to this direction. (Pun intended.)
Let There Be Light (Mode)
TGIF! Thank god it’s features, here’s what we shipped this week:
Big week for visibility—both in your data and on your screen. We’re launching 30+ additional Observatory graphs to surface insights, a Data Usage settings page for full transparency, and light mode is officially out of beta. Shine bright, friends.

Added
- Data Usage Settings Page - Know thy data. A dedicated page to see exactly how your data is being used—because transparency isn’t just a buzzword, it’s a lifestyle.
- Observatory Graphs - Finally, charts that slap. Visualize your observability data, spot trends before they spot you, and look like a genius in your next standup.
- Code Evals (Beta) - G-Eval couldn’t cut it? Write your own eval logic in code. We don’t judge. Okay, technically we do—that’s the whole point.
- Multimodal Arena - Let your vision-language models duke it out. Two models enter, one model leaves with bragging rights.
- AI Connection Upgrades - Tracing, list indexes key path, duplicate connections, max concurrency—the works. Your AI connections just got a glow-up.
Changed
- Light Mode Out of Beta - Light mode is officially here to stay. Welcome to the bright side.
- Faster Observatory Dashboards - We gave our dashboards a double espresso. Load times are now unreasonably fast.
Scaling New Heights
TGIF! Thank god it’s features, here’s what we shipped this week:
Welcome to our brand new changelog! We’re kicking things off with better cost tracking, reliability improvements, and some serious scalability upgrades.

Changelogs before this point are backfilled!
Added
- Changelog - You’re reading it! Subscribe to never miss a beat.
- Custom Model Costs - Set custom cost-per-token for any model in your project settings. Finally, accurate cost tracking for fine-tuned and self-hosted models.
- Request Timeout for AI Connections - Configure timeout limits for your LLM connections. No more hanging requests.
- High-Volume Trace Ingestion - We’ve beefed up our trace handling with buffered ingestion. Traffic spikes? Bring ‘em on.
Changed
- Smoother Experiment Runs - Real-time evaluation progress is now more reliable with improved streaming.
- Annotator Attribution - See who left that annotation. Credit where credit’s due.
- Faster Spans Loading - The spans tab now loads at lightning speed, even for trace-heavy projects.
Alert the Press, We’re Going Multimodal
TGIF! Thank god it’s features, here’s what we shipped this week:
Big week! We’re introducing alerts to keep you in the loop, shareable traces for collaboration, and multimodal support so your vision models don’t feel left out.

Added
- Public Trace Links - Share traces with anyone via a public link. Perfect for debugging with teammates or showing off to stakeholders.
- Scheduled Alerts - Set thresholds, get notified. Never let a regression slip through unnoticed again.
- Multimodal Evaluations - Images + text? We can evaluate that now. Test your vision-language models with confidence.
- Evaluation Queue - Large eval jobs now queue up nicely instead of timing out. Go big or go home.
Changed
- Snappier Dashboards - Graphs load faster. Like, noticeably faster. You’re welcome.