DeepSeek V4 브랜치에 양자화된 KV 캐시 (Quantized KV Cache) 수정 사항을 병합했습니다

확인해 보세요: https://github.com/fairydreaming/llama.cpp/tree/dsv4
이것들은 PR #25247, #25303 (본인 것) 및 #25202 (am17an 제공)입니다. 하지만 마지막 PR에서 불필요하다고 생각되는 일부 패딩 (padding) 변경 사항은 제외했습니다. 따라서 만약 충돌(crash)이 발생한다면 저에게 알려주세요.
또한 몇 가지 퍼플렉서티 (perplexity) 값입니다:
f16:
$ ./bin/llama-perplexity -m ~/ggufs/DeepSeek-V4-Flash.gguf -f ../../perplexity/wikitext-2-raw/wiki.test.raw -c 8192 -b 8192 -ub 8192 -cmoe -fit off -fa 1 0.00.474.417 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance 0.10.392.053 I 0.10.392.174 I system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.10.392.189 I perplexity: tokenizing the input ..

0.10.924.462 I perplexity: 토큰화(tokenization)에 532.264 ms 소요됨
0.10.924.610 I perplexity: 35개의 청크(chunks)에 대한 퍼플렉서티(perplexity) 계산, n_ctx=8192, batch_size=8192, n_seq=1
0.22.458.574 I perplexity: 패스당 11.53초 소요 - 예상 시간(ETA) 6.72분 [1]2.8897,[2]2.7710,[3]3.1873,[4]3.6052,[5]3.4648,[6]3.5705,[7]3.7952,[8]3.6431,[9]3.5904,[10]3.5542,[11]3.5701,[12]3.6851,[13]3.7128,[14]3.6751,[15]3.7551,[16]3.7644,[17]3.7564,[18]3.8208,[19]3.8337,[20]3.8398,[21]3.8507,[22]3.8847,[23]3.9882,[24]4.0528,[25]3.9720,[26]3.9313,[27]3.9123,[28]3.9423,[29]3.9668,[30]3.9640,[31]3.9817,[32]3.9912,[33]3.9735,[34]4.0053,[35]4.0242, 6.22.639.632 I 최종 추정치: PPL = 4.0242 +/- 0.02400
Q8_0:
$ ./bin/llama-perplexity -m ~/ggufs/DeepSeek-V4-Flash.gguf -f ../../perplexity/wikitext-2-raw/wiki.test.raw -c 8192 -b 8192 -ub 8192 -cmoe -fit off -fa 1 --cache-type-k q8_0 --cache-type-v q8_0 0.00.485.802 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance 0.10.435.253 I 0.10.435.377 I system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.10.435.393 I perplexity: 토큰화(tokenizing)에 소요됨..

0.10.961.804 I perplexity: 토큰화(tokenization)에 소요됨.. 0.10.961.950 I perplexity: 35개 청크에 대한 퍼플렉시티 계산, n_ctx=8192, batch_size=8192, n_seq=1 0.22.521.970 I perplexity: 패스당 11.56초 소요 - 예상 시간(ETA) 6.73분 [1]2.8842,[2]2.7793,[3]3.1950,[4]3.6124,[5]3.4653,[6]3.5701,[7]3.8000,[8]3.6448,[9]3.5878,[10]3.5534,[11]3.5690,[12]3.6869,[13]3.7161,[14]3.6800,[15]3.7580,[16]3.7656,[17]3.7574,[18]3.8241,[19]3.8383,[20]3.8468,[21]3.8580,[22]3.8934,[23]3.9956,[24]4.0581,[25]3.9765,[26]3.9371,[27]3.9186,[28]3.9494,[29]3.9749,[30]3.9716,[31]3.9896,[32]3.9993,[33]3.9832,[34]4.0122,[35]4.0304, 6.26.279.848 I 최종 추정치: PPL = 4.0304 +/- 0.02407
Q4_0:
$ ./bin/llama-perplexity -m ~/ggufs/DeepSeek-V4-Flash.gguf -f ../../perplexity/wikitext-2-raw/wiki.test.raw -c 8192 -b 8192 -ub 8192 -cmoe -fit off -fa 1 --cache-type-k q4_0 --cache-type-v q4_0 0.00.435.984 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance 0.10.360.658 I 0.10.360.777 I system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.10.360.794 I perplexity: 입력(input) 토큰화 중..

0.10.886.143 I perplexity: 토큰화(tokenization)에 525.34 ms 소요됨
0.10.886.291 I perplexity: n_ctx=8192, batch_size=8192, n_seq=1로 총 35개의 청크(chunks)에 대한 퍼플렉서티(perplexity) 계산 중
0.22.520.679 I perplexity: 패스당 11.63초 소요됨 - 예상 완료 시간(ETA) 6.78분 [1]3.0059,[2]2.8369,[3]3.2596,[4]3.6650,[5]3.5126,[6]3.6189,[7]3.8468,[8]3.6861,[9]3.6260,[10]3.5867,[11]3.5995,[12]3.7178,[13]3.7424,[14]3.7061,[15]3.7874,[16]3.7935,[17]3.7830,[18]3.8481,[19]3.8604,[20]3.8667,[21]3.8754,[22]3.9084,[23]4.0125,[24]4.0766,[25]3.9975,[26]3.9580,[27]3.9393,[28]3.9692,[29]3.9949,[30]3.9923,[31]4.0101,[32]4.0198,[33]4.0038,[34]4.0337,[35]4.0512, 6.28.034.177 I Final estimate: PPL = 4.0512 +/- 0.02420
제출자(submitted by) /u/fairydreaming
[링크] [댓글]

Insights

DeepSeek V4 브랜치에 양자화된 KV 캐시 (Quantized KV Cache) 수정 사항을 병합했습니다

요약

핵심 포인트

댓글

WordPress MCP 서버란 무엇이며, 왜 2026년에 중요한가

LLM이 이제 최첨단 충실도로 CAD 생성을 제어합니다

292개의 공개된 Forward Deployed Engineer 채용 공고를 분석했습니다. 여기 그 데이터가 있습니다.

여러분이 직접 테스트할 필요 없도록, 제가 2주 동안 모든 AI 코딩 모델을 테스트해 보았습니다

LLM이 이제 최첨단 충실도로 CAD 생성을 제어합니다

292개의 공개된 Forward Deployed Engineer 채용 공고를 분석했습니다. 여기 그 데이터가 있습니다.

여러분이 직접 테스트할 필요 없도록, 제가 2주 동안 모든 AI 코딩 모델을 테스트해 보았습니다