LLM 앱의 근본 문제 진단: Relari와 연속 평가 프레임워크
요약
대규모 언어 모델(LLM) 기반 애플리케이션을 개발할 때 발생하는 복잡한 파이프라인 오류를 체계적으로 진단하는 것이 핵심 과제입니다. Relari는 이러한 문제를 해결하기 위해 'continuous-eval'이라는 평가 프레임워크를 제시합니다. 이 도구는 RAG (Retrieval-Augmented Generation) 시스템과 같은 GenAI 아키텍처의 각 구성 요소(예: 쿼리 분류기, 리트리버, 리랭커 등)별로 성능을 측정할 수 있게 합니다. 이를 통해 전체 시스템 오류가 발생했을 때 어느 모듈이 문제인지 정확히 찾아내고 개선할
핵심 포인트
- Relari의 continuous-eval은 GenAI 파이프라인을 구성 요소 단위로 테스트하여, 복잡한 RAG 시스템의 근본적인 오류 원인을 진단합니다.
- 30가지 이상의 다양한 지표를 제공하며, 검색(retrieval), 텍스트 생성, 코드 생성, 에이전트 도구 사용 등 광범위한 모듈을 커버할 수 있습니다.
- 사용자 피드백 데이터를 학습하여 LLM 답변에 대한 '좋아요/싫어요' 예측 지표를 개발했으며, 이는 사용자 평가와 90%의 높은 일치도를 보입니다.
- 수동으로 구축하기 어려운 테스트 데이터셋을 위해 합성(synthetic) 데이터 생성 파이프라인을 제공하여 빠른 시작을 지원합니다.
Launch HN: Relari (YC W24) – Identify the root cause of problems in LLM apps
Hi HN, we are the founders of Relari, the company behind continuous-eval (
https://github.com/relari-ai/continuous-eval), an evaluation framework that lets you test your GenAI systems at the component level, pinpointing issues where they originate.
We experienced the need for this when we were building a copilot for bankers. Our RAG pipeline blew up in complexity as we added components: a query classifier (to triage user intent), multiple retrievers (to grab information from different sources), filtering LLM (to rerank / compress context), a calculator agent (to call financial functions) and finally the synthesizer LLM that gives the answer. Ensuring reliability became more difficult with each of these we added.
When a bad response was detected by our answer evaluator, we had to backtrack multiple steps to understand which component(s) made a mistake. But this quickly became unscalable beyond a few samples.
I did my Ph.D. in fault detection for autonomous vehicles, and I see a strong parallel between the complexity of autonomous driving software and today's LLM pipelines. In self-driving systems, sensors, perception, prediction, planning, and control modules are all chained together. To ensure system-level safety, we use granular metrics to measure the performance of each module individually. When the vehicle makes an unexpected decision, we use these metrics to pinpoint the problem to a specific component. Only then we can make targeted improvements, systematically.
Based on this thinking, we developed the first version of continuous-eval for ourselves. Since then we’ve made it more flexible to fit various types of GenAI pipelines. Continuous-eval allows you to describe (programmatically) your pipeline and modules, and select metrics for each module. We developed 30+ metrics to cover retrieval, text generation, code generation, classification, agent tool use, etc. We now have a number of companies using us to test complex pipelines like finance copilots, enterprise search, coding agents, etc.
As an example, one customer was trying to understand why their RAG system did poorly on trend analysis queries. Through continuous-eval, they realized that the “retriever” component was retrieving 80%+ of all relevant chunks, but the “reranker” component, that filters out “irrelevant” context, was dropping that to below 50%. This enabled them to fix the problem, in their case by skipping the reranker for certain queries.
We’ve also built ensemble metrics that do a surprisingly good job of predicting user feedback. Users often rate LLM-generated answers by giving a thumbs up/down about how good the answer was. We train our custom metrics on this user data, and then use those metrics to generate thumbs up/down ratings on future LLM answers. The results turn out to be 90% aligned with what the users say. This gives developers a feedback loop from production data to offline testing and development. Some customers have found this to be our most unique advantage.
Lastly, to make the most out of evaluation, you should use a diverse dataset—ideally with ground truth labels for comprehensive and consistent assessment. Because ground truth labels are costly and time-consuming to curate manually, we also have a synthetic data generation pipeline that allows you to get started quickly. Try it here (https://www.relari.ai/#synthetic_data_demo)
What’s been your experience testing and iterating LLM apps? Please let us know your thoughts and feedback on our approaches (modular framework, leveraging user feedback, testing with synthetic data).
AI 자동 생성 콘텐츠
본 콘텐츠는 HN AI Engineering의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.
원문 바로가기