RAG를 구축할 때 모두가 벡터 데이터베이스 (Vector Database)를 먼저 찾는 것은 잘못된 문제부터 해결하려는 것입니다. 기술 문서, 기업 지식 베이스, 기사 아카이브와 같은 대부분의 도메인 특화 코퍼스 (Corpora)에서는 BM25 검색 (Retrieval)이 의미론적 검색 (Semantic Search)과 경쟁할 만한 성능을 보여주며, 계산 비용은 훨씬 적고 운영은 극적으로 더 간단합니다. 이 튜토리얼에서는 Meilisearch를 검색 백엔드로 사용하여 전체 RAG 파이프라인을 구축하고, LLM API로부터 응답을 스트리밍하며, 임베딩 모델 (Embedding Model)을 단 하나도 사용하지 않고 히트 레이트 (Hit Rate)를 평가하는 방법을 보여줍니다.

왜 RAG인가, 그리고 왜 벡터 데이터베이스가 아닌가

검색 증강 생성 (Retrieval-Augmented Generation, RAG)은 근본적인 문제를 해결합니다. LLM은 지식 컷오프 (Knowledge Cutoff)와 유한한 컨텍스트 윈도우 (Context Window)를 가지고 있습니다. 여러분은 사전 학습 (Pre-training)에서 환각 (Hallucination)된 답변이 아니라, 여러분의 문서에 근거한 답변을 원합니다. 일반적인 권장 사항은 벡터 데이터베이스 (Pinecone, Weaviate, Chroma)를 사용하는 것입니다. 벡터 검색 (Vector Search)은 의미론적 유사성 (Semantic Similarity)이 중요한 오픈 도메인 검색 (Open-domain Retrieval)에서 강력합니다. 하지만 사이버 보안 지식 베이스나 의료 참조 자료와 같이 일관된 용어를 사용하는 도메인 특화 코퍼스에서는, 오타 허용 (Typo Tolerance) 기능이 있는 BM25가 GPU 비용 제로, 10ms 미만의 지연 시간 (Latency), 유지 관리할 임베딩 파이프라인 없이도 임베딩 (Embeddings)에서 얻을 수 있는 재현율 (Recall)의 85~95%를 통상적으로 달성합니다. Meilisearch는 즉시 사용 가능한 BM25와 더불어 오타 허용, 패싯 필터링 (Faceted Filtering), 그리고 간단한 REST API를 제공합니다. 이것이 제가 AYI NEDJIMI Consultants의 1,600개 이상의 기사 검색을 구동하기 위해 사용하는 방식입니다.

설정

pip install meilisearch openai httpx

로컬에서 Meilisearch 실행:
docker run -d -p 7700:7700 getmeili/meilisearch:latest

1단계: 문서 인덱싱 (Index your documents)

문서에는 id, 검색 가능한 콘텐츠 (Searchable Content), 그리고 쿼리 시점에 사용하고자 하는 필터 속성 (Filter Attributes)이 필요합니다.

import meilisearch
import hashlib
import json

MEILI_URL = " http://127.0.0.1:7700 "
MEILI_KEY = " your_master_key " # 또는 로컬 개발용으로 ""
INDEX_NAME = " knowledge_base "

client = meilisearch.Client(MEILI_URL, MEILI_KEY)

def get_or_create_index():
try:
index = client.get_index(INDEX_NAME)
except meilisearch.errors.

MeilisearchApiError: task = client.create_index(INDEX_NAME, { "primaryKey" : "id" "}) client.wait_for_task(task.task_uid) index = client.get_index(INDEX_NAME) # 검색 가능한 속성 및 필터 구성 index.update_settings({ "searchableAttributes" : [ "title" , "content" , "tags" ], "filterableAttributes" : [ "category" , "doc_type" ], "rankingRules" : [ "words" , "typo" , "proximity" , "attribute" , "sort" , "exactness" ], "typoTolerance" : { "enabled" : True , "minWordSizeForTypos" : { "oneTypo" : 4 , "twoTypos" : 8 } } }) return index def index_documents(documents: list[dict]): """ 각 문서: { "id" : str, "title" : str, "content" : str, "tags" : list[str], "category" : str, "doc_type" : str} """ index = get_or_create_index() # 안정적인 ID가 없는 경우 추가 for doc in documents: if "id" not in doc: doc["id"] = hashlib.sha256(doc["content"].encode()).hexdigest()[:16] task = index.add_documents(documents, primary_key="id") client.wait_for_task(task.task_uid) print(f " Indexed {len(documents)} documents. " ) # 예시: JSONL 파일에서 로드 def load_and_index(filepath: str): docs = [] with open(filepath) as f: for line in f: docs.append(json.loads(line.strip())) index_documents(docs) Step 2: Retrieve top-k documents def retrieve(query: str, top_k: int = 5, filters: str = "") -> list[dict]: """ 쿼리와 일치하는 상위 k개 문서를 반환합니다. 필터 예시: "category = 'security' AND doc_type = 'guide'" """ index = client.get_index(INDEX_NAME) search_params = { "limit" : top_k, "attributesToRetrieve" : [ "id" , "title" , "content" , "category" ], "attributesToHighlight" : [ "content" ], "highlightPreTag" : "" , "highlightPostTag" : "" , } if filters: search_params["filter"] = filters results = index.

search(query, search_params)
return results["hits"]

Step 3: 프롬프트 구성 (Construct the prompt)
프롬프트 구조는 매우 중요합니다. 모델이 명시적으로 근거를 기반으로 답변(grounded)하도록 해야 합니다. 즉, 검색된 청크(chunks)에 있는 내용만 인용해야 하며, 환각(hallucination)을 일으켜서는 안 됩니다.

def build_prompt(query: str, retrieved_docs: list[dict]) -> list[dict]:
    context_blocks = []
    for i, doc in enumerate(retrieved_docs, 1):
        context_blocks.append(f" [Source {i}] {doc['title']} \n {doc['content'][:1200]} ")
    
    context = " \n\n --- \n\n ".join(context_blocks)
    
    system_prompt = (
        " You are a technical assistant. Answer the user's question using ONLY "
        " the provided sources. If the answer is not in the sources, say so explicitly. "
        " Cite sources by number, e.g. [Source 1]. "
    )
    
    user_message = f"""
    Sources: {context}
    --- 
    Question: {query}
    """
    
    return [
        { "role": "system", "content": system_prompt },
        { "role": "user", "content": user_message },
    ]

Step 4: LLM 응답 스트리밍 (Stream the LLM response)
사용자에게 보내기 전에 전체 응답을 버퍼링(buffer)하지 마세요. 긴 답변의 경우 사용자 경험(UX)을 위해 스트리밍(Streaming)이 필수적입니다.

from openai import OpenAI

# generic llm_client — 호환 가능한 SDK로 교체 가능
llm_client = OpenAI(
    api_key = "your_api_key",
    base_url = "https://api.your-llm-provider.com/v1", # 제공업체에 따라 조정
)

def rag_stream(query: str, category_filter: str = ""):
    """ LLM으로부터 도착하는 대로 텍스트 청크를 생성하는 제너레이터(Generator). """
    filters = f" category = '{category_filter}'" if category_filter else ""
    docs = retrieve(query, top_k=5, filters=filters)
    
    if not docs:
        yield " No relevant documents found in the knowledge base. "
        return

    messages = build_prompt(query, docs)
    
    stream = llm_client.chat.completions.create(
        model = "gpt-4o-mini", # 또는 선호하는 모델
        messages = messages,
        stream = True,
        temperature = 0.2, # 사실 기반 검색 작업을 위해 낮은 온도로 설정
        max_tokens = 800,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            yield delta.content

content : yield delta . content Step 5: Wire it together — a minimal CLI
import sys
def main():
query = " " . join(sys.argv[1:]) if len(sys.argv) > 1 else input(" Query: ")
print(f"\nQuery: {query} \n {'=' * 60} \n")
for token in rag_stream(query):
print(token, end="", flush=True)
print("\n")
if name == "main":
main()
Usage: python rag.py "What are the key requirements of NIS 2 for SMEs?"
Step 6: Evaluate hit rate
Before deploying, measure whether your retrieval is actually finding the right documents.
You need a small golden dataset: query → expected document ID.
def evaluate_hit_rate(golden_set : list[dict], top_k : int = 5) -> float:
"""
golden_set: [{ "query" : "..." , "expected_id" : "doc_id" }, ...]
Returns hit rate @ top_k.
"""
hits = 0
for item in golden_set:
results = retrieve(item["query"], top_k=top_k)
retrieved_ids = {r["id"] for r in results}
if item["expected_id"] in retrieved_ids:
hits += 1
hit_rate = hits / len(golden_set)
print(f"Hit rate @ {top_k} : {hit_rate:.2%}" f" ({hits} / {len(golden_set)}) ")
return hit_rate

Example usage

golden = [
{ "query" : "NIS 2 SME requirements" , "expected_id" : "nis2-guide-001" },
{ "query" : "ISO 27001 certification steps" , "expected_id" : "iso27001-checklist" },
{ "query" : "penetration testing methodology" , "expected_id" : "pentest-guide-002" },
]
evaluate_hit_rate(golden, top_k=5)
On a 1,600-article cybersecurity corpus, this setup achieves roughly 91% hit rate at k=5 — without a single embedding model call.
Production considerations
Chunking strategy : For long documents, chunk at 512–800 tokens with 10% overlap. Store doc_id and chunk_index so you can reconstruct the full document if needed.
Re-ranking : If your hit rate plateaus below 85%, add a lightweight cross-encoder re-ranker as a second stage.

Sentence Transformers의 cross-encoder/ms-marco-MiniLM-L-6-v2를 사용하면 로컬에서 작동하며 약 30ms의 지연 시간(latency)이 추가됩니다. 컨텍스트 윈도우 예산(Context window budget): 문서 5개 × 1,200자 기준, 약 1,500개의 토큰(tokens)을 컨텍스트로 사용하게 됩니다. 답변을 위한 여유 공간을 확보하면서 모델의 윈도우 크기 내에 머물 수 있도록 top_k 값과 콘텐츠 절단(truncation)을 조정하세요. 캐싱(Caching): Redis를 사용하거나 간단한 인메모리 딕셔너리(in-memory dict)를 사용하여 동일한 쿼리에 대한 검색 결과를 5~15분의 TTL(Time To Live)로 캐싱하세요. 사실 관계를 묻는 쿼리의 경우 LLM 호출 결과는 더 길게 캐싱할 수 있습니다. Meilisearch를 이용한 검색(retrieval), 프롬프트 구성(prompt construction), 스트리밍 출력(streaming output)으로 이루어진 이 파이프라인은 제가 프로덕션(production) 환경에서 실제로 실행하는 방식입니다. 임베딩 파이프라인도, 벡터 데이터베이스(vector database)의 운영 오버헤드도 없습니다. 도메인 특화 검색(domain-specific retrieval)의 경우, BM25가 빈번하게 실용적인 선택지가 됩니다. 쿼리의 어휘가 문서의 어휘와 진정으로 다를 때만 시맨틱 검색(semantic search)을 고려하세요. 그렇지 않다면 더 단순한 것을 출시하세요.

Python으로 프로덕션급 RAG 파이프라인 구축하기 (벡터 데이터베이스 없이)

요약

핵심 포인트

Example usage

댓글