Reddit요약2026. 05. 08. 07:28

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM

요약

본 기술 기사는 단일 RTX 5090 GPU와 vLLM 프레임워크를 사용하여 Qwen3.6 27B 모델을 NVFP4 양자화 및 MTP(Multi-Token Prediction) 기능을 활성화한 상태에서 20만 토큰의 초대형 컨텍스트 처리를 성공적으로 수행한 결과를 공유합니다. 이 설정은 고성능 GPU 환경에서 대규모 언어 모델의 긴 컨텍스트 처리 능력을 검증하며, 특히 안정적인 성능과 효율성을 보여줍니다. 벤치마크 결과에 따르면, 20만 토큰 깊이에서의 평균 생성 속도는 약 65~75 tok/s를 기록했으며, 프롬프트 캐시(Prefix Cache) 사용 시 초기 시간(TTFT) 단축 효과가 두드러지게 나타났습니다. 이는 대규모 컨텍스트 기반의 에이전트 워크플로우에 중요한 인사이트를 제공합니다.

핵심 포인트

단일 RTX 5090 GPU 환경에서 Qwen3.6 27B 모델을 구동하는 실질적인 방법을 제시함.
NVFP4 양자화, vLLM, FlashInfer 등 최신 기술 스택 조합으로 대용량 컨텍스트 처리를 구현함.
20만 토큰 깊이에서의 평균 생성 속도는 약 65~75 tok/s로 안정성을 입증함.
프리픽스 캐싱(Prefix Caching)을 활용할 경우, 초기 응답 시간(TTFT)을 크게 단축시켜 에이전트 워크플로우에 최적화됨.

So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF.

This is not a "best possible setup" claim. More like: this is what I got working, here are the exact params, here are the numbers, and maybe it helps other 5090 owners avoid some guessing.

The short version:

Single RTX 5090, 32GB VRAM
Model: Peutlefaire/Qwen3.6-27B-NVFP4
vLLM: 0.20.1.dev0+g88d34c640.d20260502
Torch: 2.13.0.dev20260430+cu130
Driver: 595.58.03
Quantization: compressed-tensors
Attention backend: flashinfer
KV cache: fp8_e4m3
MTP enabled with 3 speculative tokens
Text-only mode
Public claim I am comfortable with: 200k context, not 220k/262k

The vLLM model endpoint reports max_model_len: 230400, but I only benchmarked up to 200k context depth. I am intentionally keeping the claim at 200k because that is what I actually validated with repeated runs.

Here are the main vLLM args:

vllm serve Peutlefaire/Qwen3.6-27B-NVFP4 \
  --host 0.0.0.0 --port 8082 \
  --safetensors-load-strategy=prefetch \
...

Startup log had the important bits I wanted to see:

Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
Available KV cache memory: 8.3 GiB
Maximum concurrency for 230,400 tokens per request: 1.00x

After the run, nvidia-smi showed about 30478 MiB / 32607 MiB used, with the vLLM EngineCore process using around 29998 MiB.

llama-benchy numbers

All of this was with:

llama-benchy 0.3.7
--pp 2048
--tg 480
--latency-mode generation
--skip-coherence
concurrency 1
War and Peace text as the long-context source

Context ladder

context depth	prefill tok/s	generation tok/s	TTFT
0	28470	86.3	0.2s
...
Then I ran a separate 10-run stability pass at 200k, with `--exit-on-first-fail`, just to make sure it was not a lucky single run.

200k stability run

pp=2048, tg=480, depth=200000, runs=10, no cache:

10/10 runs completed
exit status 0
mean prefill: 2883 tok/s
mean generation: 73.6 tok/s
generation stddev: 13.5 tok/s
mean TTFT: 70.2s
wall time: 12:48.79

Per-run generation speed:

73.04, 75.12, 63.24, 75.94, 59.02, 110.71, 64.11, 68.18, 72.55, 74.37 tok/s

So I would not cherry-pick the 93 tok/s 200k result from the smaller sweep. The more honest number for this setup is probably around 65-75 tok/s generation at 200k, depending on the run.

Prefix cache behavior

I also tested prefix caching separately. At 200k:

run	prefill tok/s	generation tok/s	TTFT
cold	2911	65.2	68.8s
warm	761	59.6	2.8s

The warm-cache prefill number is not directly comparable to cold prefill, but the TTFT drop is the useful part. For local coding / agent workflows where you keep reusing a huge prefix, this is the thing that actually feels different.

MTP telemetry

From the vLLM log across the benchmark run:

Mean MTP acceptance length: 2.28
Average draft acceptance: 42.7%
Max observed GPU KV cache usage: 88.0%

The acceptance rate moved around a lot, so I am curious if other people get better numbers with num_speculative_tokens=2 instead of 3. I started with 3 because it was stable here, but I am not claiming it is optimal.

Caveats

A few things worth saying clearly:

정확도 벤치마킹은 실행하지 않았습니다. 이는 성능/안정성만 다룹니다.
vLLM 은 NVFP4 글로벌 스케일이 정확도를 낮출 수 있다고 경고합니다. 따라서 코딩 품질에 관심이 있다면 자체 평가 (evals) 를 수행하세요.
Mamba cache align 모드의 Prefix caching 은 여전히 vLLM 에서 실험적 (experimental) 으로 표시되어 있습니다.
FlashInfer + spec decode 는 piecewise 로 CUDAGraph 모드를 강제했습니다.
비전/멀티모달 테스트는 하지 않았습니다. 이는 텍스트 전용입니다.
220k 또는 262k 를 검증하지 않았습니다. 이 실행에서 뒷받침할 수 있는 숫자는 200k 입니다.

이제 이 로컬 5090 설정에 대해 매우 만족스럽습니다. 완벽함은 아니지만, 모든 클라우드 모델을 대체한다고 자처하는 것도 아니며, 긴 로컬 코딩 세션에서는 이 카드가 제가 사온 목적을 수행하고 있다는 느낌을 받았습니다.

다른 사람이 5090 에서 Qwen3.6 27B 를 실행 중이라면, 특히 vLLM 과 NVFP4 또는 FP8 을 사용할 경우 파라미터와 MTP 설정을 비교해 주시면 감사하겠습니다. 또한 max_num_batched_tokens 에 대한 더 깔끔한 MTP 설정에 대해 궁금합니다. 왜냐하면 vLLM 은 4096 이 비최적일 수 있다고 경고하기 때문입니다.

저는 원본 llama-benchy JSON/stdout/stderr 와 전체 vLLM 로그를 로컬로 저장했습니다. 사람들이 전체 감사 추적 (audit trail) 을 확인하고 싶다면 어디서든 업로드할 수 있습니다.

저는 봇입니다. 이 작업은 자동으로 수행되었습니다.*

AI 자동 생성 콘텐츠

원문 바로가기