HN중요요약2026. 04. 24. 14:42

Ragas: LLM 애플리케이션 평가를 위한 오픈소스 툴킷

요약

Ragas는 대규모 언어 모델(LLM) 기반 애플리케이션의 성능을 객관적이고 체계적으로 평가하고 최적화할 수 있도록 설계된 종합적인 오픈소스 라이브러리입니다. 기존의 주관적이고 시간이 많이 걸리는 평가 방식에서 벗어나, LLM 기반 및 전통적인 지표를 활용하여 정밀한 데이터 중심의 검증 워크플로우를 제공합니다. 또한, 테스트 데이터셋 자동 생성 기능과 LangChain 등 인기 프레임워크와의 원활한 통합을 지원하며, 개발자가 실제 운영 환경(Production) 데이터를 활용해 지속적으로 애플리케이션을 개선할 수 있도록 돕습니다.

핵심 포인트

Ragas는 LLM 기반 및 전통적인 지표를 사용하여 RAG 시스템의 성능을 객관적이고 정밀하게 평가하는 데 특화된 도구입니다.
테스트 데이터셋이 준비되지 않은 경우에도, 운영 환경에 맞는 포괄적인 테스트 데이터셋 자동 생성 기능을 제공합니다.
LangChain과 같은 인기 LLM 프레임워크와 완벽하게 통합되며, `ragas quickstart` 명령어를 통해 쉽게 평가 프로젝트를 시작할 수 있습니다.
Aspect Critique와 같은 커스텀 지표(DiscreteMetric)를 사용하여 출력의 특정 측면(예: 요약 정확도)을 정밀하게 검증하는 것이 가능합니다.

Supercharge Your LLM Application Evaluations 🚀

Objective metrics, intelligent test generation, and data-driven insights for LLM apps

Ragas is your ultimate toolkit for evaluating and optimizing Large Language Model (LLM) applications. Say goodbye to time-consuming, subjective assessments and hello to data-driven, efficient evaluation workflows. Don't have a test dataset ready? We also do production-aligned test set generation.

🎯 Objective Metrics: Evaluate your LLM applications with precision using both LLM-based and traditional metrics.
🧪 Test Data Generation: Automatically create comprehensive test datasets covering a wide range of scenarios.
🔗 Seamless Integrations: Works flawlessly with popular LLM frameworks like LangChain and major observability tools.
📊 Build feedback loops: Leverage production data to continually improve your LLM applications.

Installation:

Pypi:

pip install ragas

Alternatively, from source:

pip install git+https://github.com/vibrantlabsai/ragas

The fastest way to get started is to use the ragas quickstart command:

# List available templates
ragas quickstart
# Create a RAG evaluation project
ragas quickstart rag_eval
# Specify where you want to create it.
ragas quickstart rag_eval -o ./my-project

Available templates:

rag_eval - Evaluate RAG systems

Coming Soon:

agent_evals - Evaluate AI agents
benchmark_llm - Benchmark and compare LLMs
prompt_evals - Evaluate prompt variations
workflow_eval - Evaluate complex workflows

agas comes with pre-built metrics for common evaluation tasks. For example, Aspect Critique evaluates any aspect of your output using DiscreteMetric:

import asyncio
from openai import AsyncOpenAI
from ragas.metrics import DiscreteMetric
from ragas.llms import llm_factory
# Setup your LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)
# Create a custom aspect evaluator
metric = DiscreteMetric(
    name="summary_accuracy",
    allowed_values=["accurate", "inaccurate"],
    prompt="""Evaluate if the summary is accurate and captures key information.
Response: {response}
Answer with only 'accurate' or 'inaccurate'."""
)
# Score your application's output
async def main():
    score = await metric.ascore(
        llm=llm,
        response="The summary of the text is..."
    )
    print(f"Score: {score.value}") # 'accurate' or 'inaccurate'
    print(f"Reason: {score.reason}")
if __name__ == "__main__":
    asyncio.run(main())

Note: Make sure your OPENAI_API_KEY environment variable is set.

Find the complete Quickstart Guide

In the past 2 years, we have seen and helped improve many AI applications using evals. If you want help with improving and scaling up your AI application using evals.

🔗 Book a slot or drop us a line: founders@vibrantlabs.com.

If you want to get more involved with Ragas, check out our discord server. It's a fun community where we geek out about LLM, Retrieval, Production issues, and more.

Developers: Those who built with ragas.
(You have import ragas somewhere in your project)

Contributors: Those who make ragas better.
(You make PR to this repo)

We welcome contributions from the community! Whether it's bug fixes, feature additions, or documentation improvements, your input is valuable.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

At Ragas, we believe in transparency. We collect minimal, anonymized usage data to improve our product and guide our development efforts.
✅ No personal or company-identifying information
✅ Open-source data collection code
✅ Publicly available aggregated data

To opt-out, set the RAGAS_DO_NOT_TRACK environment variable to true

AI 자동 생성 콘텐츠

원문 바로가기

Ragas: LLM 애플리케이션 평가를 위한 오픈소스 툴킷

요약

핵심 포인트

Supercharge Your LLM Application Evaluations 🚀

댓글