본문으로 건너뛰기

© 2026 Molayo

HN중요요약2026. 04. 24. 13:11

LLM 애플리케이션 평가 플랫폼 Confident AI 출시: DeepEval 기반의 신뢰성 확보

요약

Confident AI는 오픈소스 LLM 테스트 프레임워크인 DeepEval을 클라우드화한 플랫폼입니다. 기존 DeepEval이 단순히 평가만 수행하는 데 그쳤다면, Confident AI는 '데이터셋 편집기', '회귀 감지기(Regression Catcher)', '반복 인사이트' 기능을 추가하여 개발자 경험을 극대화했습니다. 특히 LLM-as-a-judge 방식의 한계를 극복하기 위해 DAG (Directed Acyclic Graph) 메트릭을 도입, 결정론적(deterministic) 평가 결과를 제공하며 RAG 파이프라인 및

핵심 포인트

  • DeepEval 기반의 Confident AI는 기업 환경에서 일일 60만 건 이상의 LLM 평가를 처리할 수 있는 클라우드 플랫폼을 제공합니다.
  • 플랫폼은 도메인 전문가가 데이터셋을 관리하고 테스트 결과를 분석하는 '데이터셋 편집기'와 '회귀 감지기(Regression Catcher)' 기능을 핵심적으로 지원합니다.
  • 평가 신뢰성 향상을 위해 LLM-as-a-judge 방식의 한계를 극복한 DAG (Directed Acyclic Graph) 메트릭을 도입하여 결정론적 벤치마크를 제공합니다.
  • 사용자는 이 플랫폼을 통해 다양한 LLM 모델 및 프롬프트 조합을 쉽게 비교하고 최적의 구현체를 선택할 수 있습니다.

Hi HN - we're Jeffrey and Kritin, and we're building Confident AI (
https://confident-ai.com). This is the cloud platform for DeepEval (
https://github.com/confident-ai/deepeval), our open-source package that helps engineers evaluate and unit-test LLM applications. Think Pytest for LLMs.

We spent the past year building DeepEval with the goal of providing the best LLM evaluation developer experience, growing it to run over 600K evaluations daily in CI/CD pipelines of enterprises like BCG, AstraZeneca, AXA, and Capgemini. But the fact that DeepEval simply runs, and does nothing with the data afterward, isn’t the best experience. If you want to inspect failing test cases, identify regressions, or even pick the best model/prompt combination, you need more than just DeepEval. That’s why we built a platform around it.

Here’s a quick demo video of how everything works: https://youtu.be/PB3ngq7x4ko

Confident AI is great for RAG pipelines, agents, and chatbots. Typical use cases involve allowing companies to switch the underlying LLM, rewrite prompts for newer (and possibly cheaper) models, and keep test sets in sync with the codebase where DeepEval tests are run.

Our platform features a "dataset editor," a "regression catcher," and "iteration insights". The datasets editor in Confident AI allows domain experts to edit datasets while keeping them in sync with your codebase for evaluation. We’ll then generate sharable LLM testing/benchmark reports once DeepEval has finished running evaluations on these datasets that are pulled from the cloud. The regression catcher then identifies any regressions in your new implementation, and we use these evaluation results to determine the best iteration based on your metric scores.

Our goal is to make benchmarking LLM applications so reliable that picking the best implementation is as simple as reading the metric values off the dashboard. To achieve this, the quality of curated datasets and the accuracy and reliability of metrics must be the highest possible.

This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal.

To address this, we recently released a DAG (Directed Acyclic Graph) metric in DeepEval. It is a decision-tree-based, LLM-as-a-judge metric that provides deterministic results by breaking a test case into finer atomic units. Each edge represents a decision, each node represents an LLM evaluation step, and each leaf node returns a score. It works best in scenarios where success criteria are clearly defined, such as text summarization.

The DAG metric is still in its early stages, but our hope is that by moving towards better, code-driven, open-source metrics, Confident AI can deliver deterministic LLM benchmarks that anyone can blindly trust.

We hope you’ll give Confident AI a try. Quickstart here: https://docs.confident-ai.com/confident-ai/confident-ai-intr...
The platform runs on a freemium tier, and we've dropped the need to signup with a work email for the next four days.

Looking forward to your thoughts!

AI 자동 생성 콘텐츠

본 콘텐츠는 HN AI Engineering의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.

원문 바로가기
3

댓글

0