Reddit요약2026. 05. 05. 14:32

16GB VRAM, 100k 컨텍스트 길이를 갖춘 Qwen3.6-27B 양자화 실험

요약

본 기사는 16GB VRAM 환경의 노트북에서 Qwen3.6-27B와 같은 대규모 언어 모델(LLM)을 효율적으로 실행하는 방법을 실험하고 가이드를 제공합니다. 특히, Unsloth imatrix를 사용하여 IQ4_XS GGUF 형식으로 양자화된 모델을 생성하고, `buun-llama-cpp` 포크 버전을 활용하여 최적의 성능을 확인했습니다. 사용자는 이 과정을 통해 대용량 컨텍스트(100k)와 높은 효율성을 요구하는 LLM 추론 환경을 구축할 수 있습니다.

핵심 포인트

Qwen3.6-27B 같은 대형 모델을 16GB VRAM 노트북에서 실행하기 위해 IQ4_XS GGUF 양자화 방식을 사용했습니다.
`buun-llama-cpp` 포크가 기존의 `llama-cpp-turboquant`보다 더 나은 성능과 안정성을 제공함을 확인했습니다.
모델을 로컬 서버로 구동할 때는 `llama-server`를 사용하며, 대용량 컨텍스트(100k)와 높은 GPU 활용률(`ngl 999`) 설정을 적용해야 합니다.
OpenCode 환경에서 LLM을 통합하려면, OpenAI 호환 API 엔드포인트로 설정하고 모델의 최대 컨텍스트 길이 및 기능을 명시적으로 정의해야 합니다.

A5000 16GB GPU 를 가진 노트북에서 Qwen3.6-27B 를 실행하는 방법을 실험했습니다. Unsloth imatrix 를 사용하여我自己的 IQ4_XS GGUF "qwen3.6-27b-IQ4_XS-pure.gguf"를 생성하고 다른 양자화 모델들과 평균 KLD (Kullback-Leibler Divergence) 를 비교했습니다.

buun-llama-cpp fork 가 TheTom/llama-cpp-turboquant fork 보다 더 좋다는 것을 확인했습니다. 다양한 turboquant 버전을 테스트했습니다.

내 버전을 시도하려면 다음을 수행하세요:

Huggingface 에서 my GGUF 를 다운로드합니다. 이미 이것 을 기반으로 개선된 채팅 템플릿을 포함합니다.
https://github.com/spiritbuun/buun-llama-cpp 에서 buun-llama-cpp 를 클론합니다.
빌드합니다. Windows 에서 다음과 같이 사용했습니다:cmake -B build -G Ninja -DGGML_CUDA=ON -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl cmake --build build --config Release -j 16
GPU VRAM 이 모두 자유로운지 nvidia-smi로 확인합니다.
다음과 같이 실행합니다:build/bin/llama-server --model qwen3.6-27b-IQ4_XS-pure.gguf --alias qwen3.6-27b -np 1 -ctk turbo3_tcq -ctv turbo3_tcq -c 100000 --fit off -ngl 999 --no-mmap -fa on --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0
OpenCode 에서 사용할 경우, 다음 ~/.config/opencode/opencode.json 파일을 사용합니다:

{
 "$schema": "https://opencode.ai/config.json",
 "plugin": [
  "opencode-anthropic-auth@latest",
  "opencode-copilot-auth@latest"
 ],
 "share": "disabled",
 "provider": {
  "llama.cpp": {
   "npm": "@ai-sdk/openai-compatible",
   "name": "llama.cpp (OpenAI Compatible)",
   "options": {
    "baseURL": "http://127.0.0.1:8080/v1",
    "apiKey": "1234"
   },
   "models": {
    "qwen3.5-27b": {
     "name": "Qwen 3.5 27B",
     "interleaved": {
      "field": "reasoning_content"
     },
     "limit": {
      "context": 100000,
      "output": 32000
     },
     "temperature": true,
     "reasoning": true,
     "attachment": false,
     "tool_call": true,
     "modalities": {
      "input": [
       "text"
      ],
      "output": [
       "text"
      ]
     },
     "cost": {
      "input": 0,
      "output": 0,
      "cache_read": 0,
      "cache_write": 0
     }
    }
   }
  }
 },
 "agent": {
  "code-reviewer": {
   "description": "Reviews code for best practices and potential issues",
   "model": "llama.cpp/qwen3.5-27b",
   "prompt": "You are a code reviewer. Focus on security, understandability, conciseness, maintainability and performance."
  },
  "plan": {
   "model": "llama.cpp/qwen3.5-27b"
  }
 },
 "model": "llama.cpp/qwen3.5-27b",
 "small_model": "llama.cpp/qwen3.5-27b"
}

"opencode-anthropic-auth@latest",
     "opencode-copilot-auth@latest"
     ],
     "share": "disabled",
     "provider": {
     "llama.cpp": {
       "npm": "@ai-sdk/openai-compatible",
       "name": "llama.cpp (OpenAI Compatible)",
       "options": {
         "baseURL": "http://127.0.0.1:8080/v1",
         "apiKey": "1234"
       },
       "models": {
         "qwen3.5-27b": {
           "name": "Qwen 3.5 27B",
           "interleaved": {
             "field": "reasoning_content"
           },
           "limit": {
             "context": 100000,
             "output": 32000
           },
           "temperature": true,
           "reasoning": true,
           "attachment": false,
           "tool_call": true,
           "modalities": {
             "input": [
               "text"
             ],
             "output": [
               "text"
             ]
           },
           "cost": {
             "input": 0,
             "output": 0,
             "cache_read": 0,
             "cache_write": 0
           }
         }
       }
     }
   },
   "agent": {
     "code-reviewer": {
       "description": "Reviews code for best practices and potential issues",
       "model": "llama.cpp/qwen3.5-27b",
       "prompt": "You are a code reviewer. Focus on security, understandability, conciseness, maintainability and performance."
     },
     "plan": {
       "model": "llama.cpp/qwen3.5-27b"
     }
   },
   "model": "llama.cpp/qwen3.5-27b",
   "small_model": "llama.cpp/qwen3.5-27b"
}

I get around 21 tokens/s generation speed/ 550 tokens/s prompt processing in the beginning, later it goes down to around 14 tokens/s (485 tokens/s prompt processing) at 15k context.

AI 자동 생성 콘텐츠

원문 바로가기

16GB VRAM, 100k 컨텍스트 길이를 갖춘 Qwen3.6-27B 양자화 실험

요약

핵심 포인트

댓글