마이크로서비스를 사용하여 프로덕션 등급의 AI 에이전트 시스템을 구축합니다. Python을 사용한 FastAPI, gRPC, Kafka, Kubernetes, OpenTelemetry 및 결함 허용(Fault-tolerant) 오케스트레이션 패턴을 다룹니다.

목차
서론 및 동기 (Introduction & Motivation)
핵심 아키텍처 원칙 (Core Architecture Principles)
에이전트 서비스 설계 (Agent Service Design)

AgentRunner 루프 (The AgentRunner Loop)
에이전트 간 통신 (Inter-Agent Communication)
도구 레지스트리 서비스 (Tool Registry Service)
메모리 아키텍처 (Memory Architecture)
컨텍스트 윈도우 관리 (Context Window Management)
오케스트레이터 및 감독자 패턴 (Orchestrator & Supervisor Pattern)
보안 및 권한 부여 (Security & Authorization)
관측 가능성: 트레이스, 로그, 메트릭 (Observability: Traces, Logs, Metrics)
Kubernetes에서의 배포 (Deployment on Kubernetes)
확장 전략 (Scaling Strategies)
결함 허용 및 재시도 전략 (Fault Tolerance & Retry Strategies)
에이전트 마이크로서비스 테스트 (Testing Agent Microservices)
에이전트 서비스를 위한 CI/CD 파이프라인 (CI/CD Pipeline for Agent Services)
비용 관리 및 토큰 예산 책정 (Cost Management & Token Budgeting)
프로덕션 준비 체크리스트 (Production Readiness Checklist)
참조 아키텍처 다이어그램 (Reference Architecture Diagram)

서론 및 동기 (Introduction & Motivation)

왜 모놀리식(Monolithic) 에이전트 시스템은 프로덕션에서 실패하는가
추론(Reasoning), 도구 호출(Tool calls), 메모리 검색(Memory retrieval), 출력 생성(Output generation)을 처리하는 단일 프로세스 에이전트는 프로토타입에서는 잘 작동합니다. 하지만 프로덕션 환경에서는 다음과 같이 예측 가능한 방식으로 무너집니다:

지연 시간 결합 (Latency coupling) — 하나의 느린 도구 호출이 전체 추론 루프를 차단합니다.
확장 불가능한 컴퓨팅 (Unscalable compute) — 검색 워크로드와 요약 워크로드를 독립적으로 확장할 수 없습니다.
폭발 반경 (Blast radius) — 단일 LLM API 타임아웃이나 메모리 손상이 시스템 전체를 다운시킵니다.
제로 배포 세분성 (Zero deployment granularity) — 하나의 도구 통합을 업데이트하려면 모든 것을 다시 배포해야 합니다.
비용 청구를 위한 격리 부재 (No isolation for billing) — 개별 에이전트 기능에 컴퓨팅 비용을 할당하는 것이 불가능합니다.

마이크로서비스 솔루션
각 자율적 기능은 다음과 같은 특성을 가진 독립적으로 배포 가능하고 독립적으로 확장 가능한 서비스가 됩니다:

자체 API 접점 (HTTP/gRPC)
자체 상태 확인(Health checks) 및 준비 프로브(Readiness probes)
자체 메모리 범위 (공유된 인프로세스 상태 없음)
자체 도구 바인딩 (도구 레지스트리(Tool Registry)에서 런타임에 해결됨)
자체 관측 가능성 (분산 트레이스, 메트릭, 구조화된 로그)

마이크로 에이전트(Micro Agent)란 무엇인가?

마이크로 에이전트(Micro Agent)란 다음과 같은 특징을 가진 경계가 지정된 자율 서비스(bounded autonomous service)입니다:

API 호출을 통해 작업(프롬프트(prompt) + 컨텍스트(context) + 세션 ID(session ID))을 수락함
LLM 백엔드를 사용하여 '계획(plan) → 실행(act) → 관찰(observe)' 루프를 실행함
중앙 집중식 도구 레지스트리(Tool Registry)를 통해 도구(tools)를 호출함
외부 메모리 저장소(external memory store)에서 대화 상태를 저장하고 검색함
타입이 지정된 결과(typed result)를 반환하거나 다운스트림 소비자(downstream consumers)에게 이벤트를 방출함

핵심 통찰: 마이크로 에이전트는 단순한 "스마트 함수(smart function)"가 아닙니다. 이는 자체적인 API 계약(API contract), 메모리 범위(memory scope), 장애 모드(failure modes) 및 SLA를 가진 서비스입니다. 이에 따라 설계하십시오.

핵심 아키텍처 원칙

단일 책임 (Single Responsibility)
각 에이전트는 정확히 하나의 추론 도메인(reasoning domain)을 소유합니다.
예시: 상태가 없는 추론(Stateless Reasoning), 상태가 있는 메모리(Stateful Memory)

LLM 추론 단계는 반드시 상태가 없어야(stateless) 합니다. 메모리는 외부 저장소에 존재해야 합니다. 요청 간에 프로세스 내 RAM(in-process RAM)에 대화 기록이 머물러서는 안 됩니다.

스키마 우선 도구 계약 (Schema-First Tool Contracts)
모든 도구는 에이전트가 호출하기 전에 공유 도구 레지스트리(Tool Registry)에 게시된 JSON 스키마(JSON Schema) 정의를 가져야 합니다. 임의의(ad-hoc) 함수 시그니처는 허용되지 않습니다. 이를 통해 다음이 가능해집니다:

LLM 출력이 백엔드 서비스에 도달하기 전 런타임 입력 유효성 검사(Runtime input validation)
자동 생성된 문서
하위 호환성 검사가 포함된 도구 버전 관리

멱등적 동작 (Idempotent Actions)
외부 상태를 수정하는 모든 도구 호출(이메일 전송, DB 쓰기, 웹훅(webhook) 트리거 등)은 반드시 멱등적(idempotent)이어야 합니다.
전략:

HTTP 계층에서 멱등성 키(idempotency keys) 사용 (Idempotency-Key 헤더 전달)
큐(queue) 수준에서 메시지 중복 제거 사용 (Kafka의 exactly-once semantics)
재시도 시 안전하도록 도구 핸들러 설계: 확인 후 실행(check-then-act) 패턴 사용

기본적으로 비동기 방식 (Async by Default)
오래 걸리는 에이전트 작업(다단계 조사, 코드 생성 + 실행)은 긴 타임아웃을 가진 동기식 HTTP가 아니라 반드시 비동기 작업 큐(async task queues)를 사용해야 합니다.

Client ──► POST /tasks ──► Kafka/BullMQ ──► AgentWorker
Client ──► GET /tasks/{id} ──► Redis (상태 폴링) ◄── WebSocket/SSE 푸시 (선택 사항)

명시적 컨텍스트 경계 (Explicit Context Boundaries)
각 에이전트 호출은 경계가 지정된 컨텍스트 패킷(bounded context packet)을 전달하며, 메시지 기록이 무제한으로 커지게 해서는 안 됩니다. ContextManager 서비스는 주입(injection) 전에 기록을 압축하거나 요약합니다.

에이전트 서비스 설계 프로젝트 레이아웃 (Agent Service Design Project Layout)

각 에이전트는 다음과 같은 표준 구조를 가진 컨테이너화된 FastAPI 또는 gRPC 서비스입니다:

agent-search/
├── agent/
│ ├── core.py # AgentRunner: 계획(plan) → 실행(act) → 관찰(observe) 루프
│ ├── prompts.py # 시스템 프롬프트(System prompt) + 퓨샷 템플릿(few-shot templates)
│ ├── memory.py # ContextManager: 로드(load)/압축(compress)/저장(save)
│ ├── tools.py # 도구 바인딩(Tool bindings) (도구 레지스트리(Tool Registry) 호출)
│ └── schemas.py # 모든 I/O를 위한 Pydantic 모델
├── api/
│ ├── routes.py # POST /run, GET /status/{task_id}
│ ├── middleware.py # 인증(Auth), 속도 제한(rate limiting), 요청 추적(request tracing)
│ └── deps.py # 의존성 주입(Dependency injection): DB, Redis, LLM 클라이언트
├── tests/
│ ├── unit/
│ ├── integration/
│ └── fixtures/
├── Dockerfile
├── pyproject.toml
└── k8s/
├── deployment.yaml
├── service.yaml
├── hpa.yaml
└── configmap.yaml

API 규약 (API Contract)

모든 에이전트는 최소한 다음의 HTTP 엔드포인트(endpoints)를 노출합니다:

POST /run: 작업 제출 (동기(sync), 짧은 작업만 가능)
POST /tasks: 작업 제출 (비동기(async), task_id 반환)
GET /tasks/{task_id}: 작업 상태 및 결과 폴링(Poll)
GET /health: 활성 프로브(Liveness probe)
GET /ready: 준비 프로브(Readiness probe) (LLM + 메모리 저장소 확인)
GET /metrics: Prometheus 메트릭(metrics) 엔드포인트

agent/schemas.py

from pydantic import BaseModel, Field
from typing import Optional, Dict, Any
from enum import Enum

class TaskStatus(str, Enum):
PENDING = "pending"
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
CANCELLED = "cancelled"

class AgentTask(BaseModel):
id: str
session_id: str
prompt: str
metadata: Dict[str, Any] = Field(default_factory=dict)
max_steps: int = Field(default=10, ge=1, le=25)
token_budget: int = Field(default=8192, ge=512, le=32768)

class AgentResult(BaseModel):
task_id: str
status: TaskStatus
output: Optional[str] = None
steps_used: int = 0
tokens_used: int = 0
tool_calls: int = 0
error: Optional[str] = None
duration_ms: int = 0

AgentRunner 루프 전체 구현 (The AgentRunner Loop Full Implementation)

agent/core.py

import asyncio
import time
from opentelemetry import trace
from tenacity import retry,

stop_after_attempt, wait_exponential_jitter
tracer = trace.get_tracer(name)
MAX_STEPS = 15
class AgentRunner:
def init(self, agent_id: str, config: AgentConfig):
self.agent_id = agent_id
self.llm = LLMClient(model=config.model, timeout=30)
self.memory = ContextManager(agent_id, max_tokens=config.context_limit)
self.tools = ToolRegistryClient(config.tool_registry_url)
self.metrics = AgentMetrics(agent_id)

async def run(self, task: AgentTask) -> AgentResult:
    start = time.monotonic()
    with tracer.start_as_current_span("agent.run") as span:
        span.set_attribute("agent.id", self.agent_id)
        span.set_attribute("agent.task_id", task.id)
        span.set_attribute("agent.session", task.session_id)
    try:
        result = await self._run_loop(task, span)
    except TokenBudgetExceeded as e:
        result = AgentResult(task_id=task.id, status=TaskStatus.COMPLETED, output=e.partial_output, error="token_budget_exceeded")
    except Exception as e:
        span.record_exception(e)
        result = AgentResult(task_id=task.id, status=TaskStatus.FAILED, error=str(e))
    finally:
        result.duration_ms = int((time.monotonic() - start) * 1000)
        self.metrics.record(result)
    return result

async def _run_loop(self, task: AgentTask, span) -> AgentResult:
# Load available tools from registry
tool_schemas = await self.tools.fetch(agent_id=self.agent_id)
# Load and compress conversation history
context = await self.memory.load(task.session_id)
messages = build_messages(context, task.prompt)
total_tokens = 0
tool_call_count = 0
for step in range(task.max_steps):
span.set_attribute("agent.current_step", step)
with tracer.start_as_current_span("agent.llm_call") as llm_span:
response = await self._complete_with_retry(messages, tool_schemas)
llm_span.set_attribute("llm.prompt_tokens", response.usage.

prompt_tokens ) llm_span.set_attribute("llm.completion_tokens", response.usage.completion_tokens) total_tokens += response.usage.total_tokens if total_tokens > task.token_budget: raise TokenBudgetExceeded(partial_output=response.content, tokens_used=total_tokens) if response.finish_reason == "stop": await self.memory.save(task.session_id, messages + [response.message]) return AgentResult(task_id=task.id, status=TaskStatus.COMPLETED, output=response.content, steps_used=step+1, tokens_used=total_tokens, tool_calls=tool_call_count) if response.tool_calls: tool_call_count += len(response.tool_calls) results = await self._execute_tools(response.tool_calls) messages.append(response.message) messages.extend(tool_result_messages(results)) # 최대 스텝 도달 — 사용 가능한 최적의 출력 반환 return AgentResult(task_id=task.id, status=TaskStatus.COMPLETED, output=response.content, steps_used=task.max_steps, tokens_used=total_tokens, error="max_steps_reached") @retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter(max=15)) async def _complete_with_retry(self, messages, tools): return await self.llm.complete(messages=messages, tools=tools) async def _execute_tools(self, tool_calls): tasks = [self.tools.invoke(tc) for tc in tool_calls] return await asyncio.gather(*tasks, return_exceptions=True)

Inter-Agent Communication Pattern Selection Matrix
gRPC Service Definition
For synchronous sub-agent calls, gRPC provides strong typing, bidirectional streaming, and efficient binary serialization.
// proto/agent_service.proto
syntax = "proto3"
package agents

v1 ; service AgentService { rpc RunTask ( TaskRequest ) returns ( TaskResponse ); rpc StreamSteps ( TaskRequest ) returns ( stream StepEvent ); rpc Health ( HealthRequest ) returns ( HealthResponse ); } message TaskRequest { string task_id = 1 ; string session_id = 2 ; string prompt = 3 ; map < string , string > metadata = 4 ; int32 max_steps = 5 ; int32 token_budget = 6 ; } message TaskResponse { string task_id = 1 ; string status = 2 ; string output = 3 ; int32 steps_used = 4 ; int32 tokens_used = 5 ; string error = 6 ; } message StepEvent { int32 step_number = 1 ; string type = 2 ; // "llm_call" | "tool_call" | "tool_result" string content = 3 ; } Kafka 이벤트 스키마: 에이전트 간 비동기 파이프라인 핸드오프를 위해 Schema Registry에 등록된 Avro 또는 JSON 스키마를 사용합니다. { "schema" : { "type" : "record" , "name" : "AgentTaskEvent" , "namespace" : "com.myco.agents.v1" , "fields" : [ { "name" : "task_id" , "type" : "string" }, { "name" : "source_agent" , "type" : "string" }, { "name" : "target_agent" , "type" : "string" }, { "name" : "session_id" , "type" : "string" }, { "name" : "prompt" , "type" : "string" }, { "name" : "context" , "type" : { "type" : "map" , "values" : "string" }}, { "name" : "created_at" , "type" : { "type" : "long" , "logicalType" : "timestamp-millis" }} ] } } Kafka Producer (Orchestrator 내) # Orchestrator에서 agent-search로 디스패치할 때 aiokafka에서 가져옵니다. from aiokafka import AIOKafkaProducer import json async def dispatch_to_agent ( target_agent : str , task : AgentTask ): producer = AIOKafkaProducer ( bootstrap_servers = KAFKA_BROKERS ) await producer . start () try : event = { " task_id " : task . id , " source_agent " : " orchestrator " , " target_agent " : target_agent , " session_id " : task . session_id , " prompt " : task . prompt , " created_at " : int ( time . time () * 1000 ) } await producer . send_and_wait ( topic = f " agent.tasks. { target_agent } " , value = json . dumps ( event ). encode (), key = task . session_id .

encode(), # 세션별로 파티셔닝(partition by session)
headers = [("trace-id", get_current_trace_id().encode())])
finally:
await producer.stop()

도구 레지스트리 서비스 아키텍처 (Tool Registry Service Architecture)

도구 레지스트리(Tool Registry)는 도구 정의를 저장, 검증 및 제공하는 중앙 집중식 FastAPI 서비스입니다. 이는 모든 에이전트→도구(agent→tool) 트래픽을 위한 타입 지정 API 게이트웨이(typed API gateway) 역할을 합니다.

도구 등록 스키마 (Tool Registration Schema)

도구가 시작 시 스스로를 등록함

class ToolDefinition(BaseModel):
name: str
version: str
description: str
parameters: Dict[str, Any] # JSON Schema 반환
endpoint: str # 레지스트리가 호출을 라우팅할 위치
health_url: str
auth_type: str # "api_key" | "oauth2" | "none"
rate_limit: int # 에이전트당 분당 호출 횟수
timeout_ms: int = 10000 # 밀리초 단위 타임아웃

도구 서비스 시작 시 등록 호출

@app.on_event("startup")
async def register_tool():
registry = ToolRegistryClient(TOOL_REGISTRY_URL)
await registry.register(ToolDefinition(name="web_search", version="2.1.0"

프로덕션 등급의 마이크로서비스로서 마이크로 에이전트(Micro Agents) 구축하기

요약