cuda: 288명의 전문가(experts)를 위한 topk-moe fusion 활성화 ([#25267](https://github.com/gg

cuda: 288명의 전문가(experts)를 위한 topk-moe fusion 활성화 (#25267)

cuda: 288명의 전문가(experts)를 위한 topk-moe fusion 활성화

기존의 topk-moe fusion은 2의 거듭제곱 형태의 전문가 수(또는 특수 사례인 576)만 허용했기 때문에, 288명의 전문가를 가진 모델(예: Step-3.7-Flash)은 fusion되지 않은 레이어별 라우팅 체인(unfused per-layer routing chain)인 softmax/sigmoid, argsort, get_rows, sum_rows, div, clamp, scale로 되돌아갔습니다(fell back). 배치 크기(batch size)가 1일 때, 이는 토큰당 약 330개의 추가적인 작은 그래프 노드(tiny graph nodes)를 발생시킵니다.

288은 워프 크기(warp size)의 배수이므로 기존 커널(kernel)이 이미 이를 처리할 수 있습니다. 이번 변경 사항은 누락되었던 템플릿 인스턴스화(template instantiation)를 추가하고, 적격성 검사(eligibility check)에서 288을 허용하도록 합니다.

Step-3.7-Flash IQ4_XS를 사용하여 gfx1151에서 측정한 결과(llama-bench, -b 4096 -ub 4096 -fa 1 -dio 1 -ctk q8_0 -ctv q8_0; 머신 유휴 상태, 부하 제어를 위해 pp4096을 일치시킨 전/후 쌍 비교):

test | before | after
----------------+----------------+----------------
pp4096 | 460.99 ± 0.45 | 462.47 ± 0.34 (변화 없음)
tg128 | 19.10 ± 0.04 | 19.56 ± 0.03 (+2.4%)
tg128 @ d30000 | 12.68 ± 0.04 | 12.69 ± 0.03 (변화 없음)

프롬프트 처리(Prompt processing)는 영향을 받지 않습니다 (fusion은 디코드 라우팅(decode routing)에만 적용됩니다). 디코드 이득(decode gain)은 짧은 컨텍스트(shallow context)에서 약 +2.4%이며, 컨텍스트가 깊어질수록 점차 사라집니다. 30k 토큰에 도달하면 각 단계가 KV 캐시(KV cache)에 대한 어텐션 바운드(attention-bound) 상태가 되므로, 고정된 라우팅 오버헤드(fixed routing overhead)를 제거하는 효과가 더 이상 눈에 띄지 않게 됩니다.

Assisted-By: Claude Fable 5 noreply@anthropic.com

tests/test-backend-ops.cpp 업데이트

Co-authored-by: Oliver Simons osimons@nvidia.com

topk-moe.cu의 288 케이스에 대한 주석 추가

Co-authored-by: Oliver Simons osimons@nvidia.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Insights

cuda: 288명의 전문가(experts)를 위한 topk-moe fusion 활성화 ([#25267](https://github.com/gg

요약

핵심 포인트

댓글

AI에게 당신의 언어를 가르치기: 인디 게임 개발자를 위한 프롬프트 엔지니어링 (Prompt Engineering)

자체적인 AI 가시성 감사(AI Visibility Audit)를 수행하는 방법: 2026년을 위한 무료 7단계 방법론

ResistX: AI 기반의 오프라인 우선 재난 대응 시스템 구축 (Hackhazards '26)

macOS용 Claude Code를 위한 스마트 알림

자체적인 AI 가시성 감사(AI Visibility Audit)를 수행하는 방법: 2026년을 위한 무료 7단계 방법론

ResistX: AI 기반의 오프라인 우선 재난 대응 시스템 구축 (Hackhazards '26)

macOS용 Claude Code를 위한 스마트 알림