LLM 추론 속도를 위한 Decoding Attention 최적화 라이브러리

요약

Bruce-Lee-LY/decoding_attention 은 LLM 추론의 디코딩 단계를 극대화하기 위해 MHA, MQA, GQA, MLA 등 다양한 주의 메커니즘을 CUDA 코어 기반으로 특별히 최적화된 C++ 라이브러리입니다. Flash-Attention 및 관련 기술과 호환되며, NVIDIA GPU 환경에서 대규모 언어 모델의 성능을 높이는 데 중점을 둡니다.

핵심 포인트

LLM 추론의 디코딩 단계를 위한 MHA, MQA, GQA, MLA 등 다양한 주의 메커니즘 최적화 제공
CUDA 코어를 활용한 고품질 성능 최적화를 목표로 함
Flash-Attention, FlashInfer, FlashMLA 등 기존 가속화 기술과의 생태계 연동 지원

Decoding Attention

Repository: Bruce-Lee-LY/decoding_attention
Language: C++
Stars: 46
Forks: 4
Topics: cuda, cuda-core, decoding-attention, flash-attention, flashinfer, flashmla, gpu, gqa, inference, large-language-model, llm, mha, mla, mqa, multi-head-attention, nvidia

Description:
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

AI 자동 생성 콘텐츠

원문 바로가기

LLM 추론 속도를 위한 Decoding Attention 최적화 라이브러리

요약

핵심 포인트

Decoding Attention

댓글