2x4060에서 DFlash 작동 확인 - Tensor+MTP보다 느림
요약
2x4060 환경에서 DFlash와 MTP(Multi-Token Prediction) 기술을 비교 벤치마크한 결과, 현재는 MTP+Tensor 조합이 DFlash보다 더 빠른 성능을 보였습니다. Hugging Face의 DFlash 양자화 모델 사용 시 주의사항과 직접 빌드하는 과정을 공유합니다.
핵심 포인트
- DFlash Layer 방식은 MTP+Tensor 방식보다 추론 속도가 느림
- DFlash의 Draft n-max 값이 커질수록 Acceptance Rate가 급격히 하락함
- Hugging Face의 DFlash 양자화 모델은 현재 작동이 불안정할 수 있음
- 직접 빌드하는 것이 시간을 절약하는 방법임
몇 시간 동안의 디버깅 끝에, 드디어 dflash + qwen을 작동시켰습니다!!
하지만 현재 사용 중인 mtp+tensor가 여전히 더 빠르다는 사실을 발견했습니다 :/
첫 번째 참고 사항: Tensor+Dflash는 아직 지원되지 않으므로, 아래 벤치마크는 tensor+mtp vs layer+dflash 비교입니다.
Config Split Mode Spec Type Draft n-max Eval Speed (t/s) Prompt Speed (t/s) Total Tokens Draft Acceptance Mean Draft Len Notes
DFlash Layer draft-dflash 2 89.36 1,868 11,130 0.746 2.49 -
DFlash Layer draft-dflash 4 101.71 1,766 11,340 0.612 3.45 Best DFlash
DFlash Layer draft-dflash 8 84.89 1,857 11,203 0.346 3.77 Acceptance drops hard
MTP Layer draft-mtp 2 82.93 3,020 11,759 0.652 2.30 -
MTP + Tensor Tensor draft-mtp 2 116.47 2,501 13,736 0.767 2.53 MVP (~115-125)
hf에서 dflash 양자화(quants) 모델 중 하나를 시도하며 시간을 낭비하기 전에 주의하세요.. 그것들은 형편없이 망가져 있으며, 직접 빌드하는 데 2분이 걸립니다. 그랬다면 제 4시간을 아낄 수 있었을 겁니다..
mkdir -p /data/drafters/Qwen3.6-35B-A3B-DFlash mkdir -p /data/drafters/Qwen3.6-35B-A3B-target-meta mkdir -p /data/llama_presets/dflash hf download z-lab/Qwen3.6-35B-A3B-DFlash \ --local-dir /data/drafters/Qwen3.6-35B-A3B-DFlash hf download Qwen/Qwen3.6-35B-A3B \ config.json \ tokenizer.json \ tokenizer_config.json \ generation_config.json \ --local-dir /data/drafters/Qwen3.6-35B-A3B-target-meta python convert_hf_to_gguf.py \ /data/drafters/Qwen3.6-35B-A3B-DFlash \ --target-model-dir /data/drafters/Qwen3.6-35B-A3B-target-meta \ --outtype bf16 \ --outfile /data/llama_presets/dflash/Qwen3.6-35B-A3B-DFlash-bf16.gguf cmake -B build \ -DGGML_NATIVE=ON \ -DLLAMA_BUILD_TESTS=OFF \ .
전체 벤치마크:
[AA.qwen36-35b-a3b-dflash-q4xl-layer-2gpu] hf-repo = unsloth/Qwen3.6-35B-A3B-MTP-GGUF hf-file = Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf spec-draft-model = /presets/dflash/Qwen3.6-35B-A3B-DFlash-Q8_0.gguf split-mode = layer tensor-split = 1,1 ctx-size = 125000 spec-draft-n-max = 2 spec-type = draft-dflash [36259] 0.09.944.818 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 6007, progress = 0.70, t = 3.61 s / 1666.18 tokens per second [36259] 0.10.652.662 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 8055, progress = 0.93, t = 4.31 s / 1867.57 tokens per second [36259] 0.13.971.626 I slot print_timing: id 0 | task 0 | n_decoded = 266, tg = 88.17 t/s, tg_3s = 88.17 t/s [36259] 0.16.975.275 I slot print_timing: id 0 | task 0 | n_decoded = 538, tg = 89.36 t/s, tg_3s = 90.56 t/s ... [36259] 0.35.085.404 I slot print_timing: id 0 | task 0 | n_decoded = 2192, tg = 90.84 t/s, tg_3s = 93.71 t/s [36259] 0.38.086.512 I slot print_timing: id 0 | task 0 | n_decoded = 2435, tg = 89.75 t/s, tg_3s = 80.97 t/s [36259] 0.38.997.732 I slot print_timing: id 0 | task 0 | prompt eval time = 4615.02 ms / 8624 tokens ( 0.54 ms per token, 1868.68 tokens per second) [36259] 0.38.997.736 I slot print_timing: id 0 | task 0 | eval time = 28042.92 ms / 2506 tokens ( 11.19 ms per token, 89.36 tokens per second) [36259] 0.38.997.737 I slot print_timing: id 0 | task 0 | total time = 32657.94 ms / 11130 tokens [36259] 0.38.997.740 I slot print_timing: id 0 | task 0 | graphs reused = 996 [36259] 0.38.997.743 I slot print_timing: id 0 | task 0 | draft acceptance = 0.74602 ( 1501 accepted / 2012 generated), mean len = 2.49 [AA.qwen36-35b-a3b-dflash-q4xl-layer-2gpu] ...
split-mode = layer spec-draft-n-max = 4 spec-type = draft-dflash [41155] 0.09.370.448 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 4096, progress = 0.47, t = 3.03 s / 1351.34 tokens per second [41155] 0.10.103.966 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 6007, progress = 0.70, t = 3.76 s / 1595.66 tokens per second [41155] 0.10.898.156 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 8055, progress = 0.93, t = 4.56 s / 1766.92 tokens per second [41155] 0.14.245.688 I slot print_timing: id 0 | task 0 | n_decoded = 246, tg = 81.39 t/s, tg_3s = 81.39 t/s [41155] 0.17.250.295 I slot print_timing: id 0 | task 0 | n_decoded = 522, tg = 86.61 t/s, tg_3s = 91.86 t/s ... [41155] 0.32.360.984 I slot print_timing: id 0 | task 0 | n_decoded = 2162, tg = 102.28 t/s, tg_3s = 124.22 t/s [41155] 0.35.364.050 I slot print_timing: id 0 | task 0 | n_decoded = 2476, tg = 102.57 t/s, tg_3s = 104.56 t/s [41155] 0.37.926.245 I slot print_timing: id 0 | task 0 | prompt eval time = 4883.73 ms / 8624 tokens ( 0.57 ms per token, 1765.86 tokens per second) [41155] 0.37.926.249 I slot print_timing: id 0 | task 0 | eval time = 26702.98 ms / 2716 tokens ( 9.83 ms per token, 101.71 tokens per second) [41155] 0.37.926.249 I slot print_timing: id 0 | task 0 | total time = 31586.71 ms / 11340 tokens [41155] 0.37.926.254 I slot print_timing: id 0 | task 0 | graphs reused = 776 [41155] 0.37.926.257 I slot print_timing: id 0 | task 0 | draft acceptance = 0.61245 ( 1928 accepted / 3148 generated), mean len = 3.45 [AA.qwen36-35b-a3b-dflash-q4xl-layer-2gpu]...
split-mode = layer spec-draft-n-max = 8 spec-type = draft-dflash [54233] 0.09.981.921 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 6007, progress = 0.70, t = 3.61 s / 1661.90 tokens per second [54233] 0.10.706.690 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 8055, progress = 0.93, t = 4.34 s / 1856.29 tokens per second [54233] 0.14.051.754 I slot print_timing: id 0 | task 0 | n_decoded = 232, tg = 76.32 t/s, tg_3s = 76.32 t/s [54233] 0.17.074.437 I slot print_timing: id 0 | task 0 | n_decoded = 484, tg = 79.83 t/s, tg_3s = 83.37 t/s .. [54233] 0.38.199.247 I slot print_timing: id 0 | task 0 | n_decoded = 2408, tg = 88.57 t/s, tg_3s = 95.74 t/s [54233] 0.41.216.508 I slot print_timing: id 0 | task 0 | n_decoded = 2567, tg = 84.99 t/s, tg_3s = 52.70 t/s [54233] 0.41.393.109 I slot print_timing: id 0 | task 0 | prompt eval time = 4644.35 ms / 8624 tokens ( 0.54 ms per token, 1856.88 tokens per second) [54233] 0.41.393.112 I slot print_timing: id 0 | task 0 | eval time = 30381.21 ms / 2579 tokens ( 11.78 ms per token, 84.89 tokens per second) [54233] 0.41.393.113 I slot print_timing: id 0 | task 0 | total time = 35025.56 ms / 11203 tokens [54233] 0.41.393.117 I slot print_timing: id 0 | task 0 | graphs reused = 674 [54233] 0.41.393.121 I slot print_timing: id 0 | task 0 | draft acceptance = 0.34649 ( 1896 accepted / 5472 generated), mean len = 3.77 # MTP와 비교 [BB.qwen36-35b-a3b-mtp-q4xl-layer-2gpu] hf-repo = unsloth/Qwen3.6-35B-A3B-MTP-GGUF hf-file = Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf split-mode = layer tensor-split = 1,1 ctx-size = 125000 spec-type = draft-mtp spec-draft-n-max = 2 .43.865.897 I slot print_timing: id 0 | task 1089 | prompt processing, n_tokens = 10240, progress = 0.97, t = 3.29 s / 3112.18 tokens per second [58465] 0.47.086.333 I slot print_timing: id 0 | task 1089 | n_decoded = 250, tg = 83.31 t/s, tg_3s = 83.30 t/s [58465] 0.50.108.848 I slot print_timing: id 0 | task 1089 | n_decoded = 497, tg =
82.51 t/s, tg_3s = 81.72 t/s [58465] 0.53.110.559 I slot print_timing: id 0 | task 1089 | n_decoded = 734, tg = 81.33 t/s, tg_3s = 78.95 t/s [58465] 0.56.114.649 I slot print_timing: id 0 | task 1089 | n_decoded = 982, tg = 81.63 t/s, tg_3s = 82.55 t/s [58465] 0.58.073.921 I slot print_timing: id 0 | task 1089 | prompt eval time = 3509.66 ms / 10599 tokens ( 0.33 ms per token, 3019.95 tokens per second) [58465] 0.58.073.931 I slot print_timing: id 0 | task 1089 | eval time = 13988.52 ms / 1160 tokens ( 12.06 ms per token, 82.93 tokens per second) [58465] 0.58.073.932 I slot print_timing: id 0 | task 1089 | total time = 17498.18 ms / 11759 tokens [58465] 0.58.073.933 I slot print_timing: id 0 | task 1089 | graphs reused = 1570 [58465] 0.58.073.935 I slot print_timing: id 0 | task 1089 | draft acceptance = 0.65209 ( 656 accepted / 1006 generated), mean len = 2.30 # MTP+Tensor MVP [AA.qwen36-35b-a3b-mtp-q4xl-tensor-2gpu] hf-repo = unsloth/Qwen3.6-35B-A3B-MTP-GGUF hf-file = Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf split-mode = tensor tensor-split = 1,1 ctx-size = 125000 spec-type = draft-mtp spec-draft-n-max = 2 [34815] 0.45.342.497 I slot print_timing: id 0 | task 1233 | prompt processing, n_tokens = 8192, progress = 0.75, t = 3.23 s / 2532.77 tokens per second [34815] 0.46.162.930 I slot print_timing: id 0 | task 1233 | prompt processing, n_tokens = 10240, progress = 0.94, t = 4.05 s / 2525.38 tokens per second [34815] 0.49.464.775 I slot print_timing: id 0 | task 1233 | n_decoded = 337, tg = 111.84 t/s, tg_3s = 111.84 t/s [34815] 0.52.466.053 I slot print_timing: id 0 | task 1233 | n_decoded = 640, tg = 106.41 t/s, tg_3s = 100.96 t/s ..
[34815] 1.07.531.394 I slot print_timing: id 0 | task 1233 | n_decoded = 2398, tg = 113.76 t/s, tg_3s = 124.82 t/s [34815] 1.10.552.860 I slot print_timing: id 0 | task 1233 | n_decoded = 2795, tg = 115.97 t/s, tg_3s = 131.39 t/s [34815] 1.11.119.277 I slot print_timing: id 0 | task 1233 | prompt eval time = 4343.34 ms / 10863 tokens ( 0.40 ms per token, 2501.07 tokens per second) [34815] 1.11.119.286 I slot print_timing: id 0 | task 1233 | eval time = 24667.71 ms / 2873 tokens ( 8.59 ms per token, 116.47 tokens per second) [34815] 1.11.119.287 I slot print_timing: id 0 | task 1233 | total time = 29011.06 ms / 13736 tokens [34815] 1.11.119.288 I slot print_timing: id 0 | task 1233 | graphs reused = 2336 [34815] 1.11.119.291 I slot print_timing: id 0 | task 1233 | draft acceptance = 0.76743 ( 1739 accepted / 2266 generated), mean len = 2.53
tensor+dlfash를 기대하지만, 지금은 여전히 완전히 고장난 상태입니다.
[BB.qwen36-35b-a3b-dflash-q4xl-tensor-2gpu] [38761] /app/ggml/src/ggml-backend-meta.cpp:730: GGML_ASSERT(split_states_equal(src_ss[0], src_ss[2])) failed [38761] /app/libggml-base.so.0(+0x1b1f6)[0x7f3bf508f1f6] [38761] /app/libggml-base.so.0(ggml_print_backtrace+0x21a)[0x7f3bf508f67a] [38761] /app/libggml-base.so.0(ggml_abort+0x15b)[0x7f3bf508f85b] [38761] /app/libggml-base.so.0(+0x46478)[0x7f3bf50ba478] [38761] /app/libggml-base.so.0(+0x3e085)[0x7f3bf50b2085] [38761] /app/libggml-base.so.0(+0x47a19)[0x7f3bf50bba19] [38761] /app/libggml-base.so.0(+0x4a320)[0x7f3bf50be320] [38761] /app/libggml-base.so.0(ggml_gallocr_alloc_graph+0x493)[0x7f3bf50a5b03] [38761] /app/libggml-base.so.0(ggml_backend_sched_alloc_graph+0x18f)[0x7f3bf50ac1df] [38761] /app/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xeb)[0x7f3bf523342b] [38761] /app/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x378)[0x7f3bf5239238]
제출자: /u/Chuyito
[링크] [댓글]
AI 자동 생성 콘텐츠
본 콘텐츠는 r/LocalLLaMA의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.
원문 바로가기