NVIDIA/TensorRT-LLM: GPU 기반 LLM 추론 최적화 프레임워크

요약

TensorRT-LLM은 NVIDIA GPU에서 대규모 언어 모델(LLM) 추론을 효율적으로 수행하기 위한 파이썬 API를 제공합니다. 최신 최적화 기법을 지원하며, 성능 높은 추론 실행을 오케스트레이션하는 파이썬 및 C++ 런타임 컴포넌트를 포함합니다.

핵심 포인트

TensorRT-LLM은 NVIDIA GPU에서 LLM 추론을 위한 쉬운 사용 파이썬 API를 제공합니다.
State-of-the-art 최적화 기법을 적용하여 효율적인 추론 성능을 확보할 수 있습니다.
파이썬과 C++ 런타임을 제공하여 추론 실행을 효과적으로 오케스트레이션합니다.

NVIDIA/TensorRT-LLM

Repository: NVIDIA/TensorRT-LLM
Language: Python
Stars: 13470
Forks: 2319
Topics: blackwell, cuda, llm-serving, moe, pytorch

Description:
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

AI 자동 생성 콘텐츠

원문 바로가기

NVIDIA/TensorRT-LLM: GPU 기반 LLM 추론 최적화 프레임워크

요약

핵심 포인트

NVIDIA/TensorRT-LLM

댓글