llama.cpp헤드라인2026. 05. 22. 07:23

server : VRAM 누수 수정을 위해 절전 시 free draft/MTP 리소스 해제 ( #23461 ) server_context_im

요약

server_context_impl의 destroy() 함수에서 추측 디코더와 MTP 관련 리소스를 해제하지 않아 발생하던 VRAM 누수 문제를 해결했습니다. 절전 모드 진입 시 명시적인 리셋 과정을 추가하여 메모리 부족(OOM) 오류와 use-after-free 문제를 방지합니다.

핵심 포인트

spec, ctx_dft, model_dft 리소스 해제 로직 추가
절전/재개 주기 반복 시 발생하는 VRAM 누수 수정
메모리 부족(OOM)으로 인한 서버 충돌 방지
적절한 정리 순서 보장을 통한 use-after-free 방지

server : VRAM 누수 수정을 위해 절전 시 free draft/MTP 리소스 해제 ( #23461 )

server_context_impl 내의 destroy() 함수는 메인 모델과 컨텍스트(llama_init.reset()을 통해)만 정리했을 뿐, 추측 디코더 (speculative decoder, spec), 초안 컨텍스트 (draft context, ctx_dft), 또는 초안 모델 (draft model, model_dft)을 해제하지 않았습니다. MTP (Multi-Token Prediction) 모델의 경우, ctx_dft는 절전 상태로 진입할 때 해제되지 않는 GPU 할당 리소스 (KV 캐시, 연산 버퍼)를 보유합니다. 각 절전/재개(sleep/resume) 주기마다 기존 리소스가 해제되지 않은 상태에서 새로운 리소스가 할당되어, 결국 메모리 부족(out-of-memory) 오류로 서버가 충돌하는 VRAM 누수가 발생했습니다. llama_init을 리셋하기 전 destroy()에서 spec, ctx_dft, model_dft를 명시적으로 리셋하여, use-after-free를 방지하는 적절한 정리 순서를 보장함으로써 이를 수정했습니다. ref: #23395

Assisted-by: llama.cpp:local

pi macOS/iOS:
macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled)
macOS Intel (x64)
iOS XCFramework

Linux:
Ubuntu x64 (CPU)
Ubuntu arm64 (CPU)
Ubuntu s390x (CPU)
Ubuntu x64 (Vulkan)
Ubuntu arm64 (Vulkan)
Ubuntu x64 (ROCm 7.2)
Ubuntu x64 (OpenVINO)
Ubuntu x64 (SYCL FP32)
Ubuntu x64 (SYCL FP16)

Android:
Android arm64 (CPU)

Windows:
Windows x64 (CPU)
Windows arm64 (CPU)
Windows x64 (CUDA 12) - CUDA 12.4 DLLs
Windows x64 (CUDA 13) - CUDA 13.1 DLLs
Windows x64 (Vulkan)
Windows x64 (SYCL)
Windows x64 (HIP)

openEuler:
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

AI 자동 생성 콘텐츠

원문 바로가기

server : VRAM 누수 수정을 위해 절전 시 free draft/MTP 리소스 해제 ( #23461 ) server_context_im

요약

핵심 포인트

댓글