SlackJust In: New Slack Community! Connect with AI engineers building with Confident AI, join now →
KNOWLEDGE BASE

User-Facing vs. Non-User-Facing Apps

If you are evaluating an LLM application, one of the first things to figure out is whether a human sees the output directly. A voice agent taking hotel reservations and a contract extraction pipeline can run on the same model, but they fail in completely different ways and they need different evals.

User-facing applications fail on experience, not only on facts. Tone, sentiment, helpfulness, role adherence, and how the conversation feels sit beside correctness. A reply can be factual and still end the relationship. Your test cases need real mess:

  • Vague or ambiguous asks
  • Frustrated or emotional language
  • Mixed intents in a single message
  • People who do not use your product vocabulary

If your suite is only crisp Q&A with clean expected answers, you will pass evals and lose users.

Non-user-facing applications are judged on task completion and accuracy. Did the summary contain the required points? Did extraction match the schema? Did the job finish? A "warm" JSON buys you nothing. The pushback is usually "we might add a UI later." Fine — when a human is in the loop, rebalance the metric set. Until then, optimize for the outcome the downstream system or analyst actually uses.

Which one are you building?

This is not just a metric question — it changes what you can observe in production. User-facing apps have signals that do not exist in automation:

  • User-facing: intent, sentiment, abandonment, repeat questions — patterns that only show up when a person is on the other end. Teams that skip these signals on a patient intake assistant or a financial advisor agent learn about quality breakdown from complaints instead of early detection.
  • Non-user-facing: operational failure modes, weird outputs, bad inputs. There is no "user frustration" to detect. If you wire user-intent classifiers to a recruiting screener that filters resumes, you are solving a problem you do not have.

Decide which world you are in early. It drives which metrics matter, which signals to configure, what your datasets need to contain, and what "drift" means on your dashboard. The model name on the box does not tell you — the presence or absence of a human on the other end does.

TL;DR — User-facing apps fail on experience; non-user-facing apps fail on accuracy. Same model, different eval strategy.