HuggingFace헤드라인2026. 05. 07. 07:47

Hugging Face Diffusers 의 양자화 백엔드 탐색

요약

본 기술 기사는 Hugging Face Diffusers 라이브러리에서 다양한 양자화 백엔드(bitsandbytes, GGUF, torchao, Quanto, FP8 등)를 사용하여 대규모 확산 모델(Diffusion Model)의 메모리 효율성을 높이는 방법을 심층적으로 탐구합니다. 특히 Flux-dev와 같은 강력한 모델을 BF16 정밀도로 로드할 때 필요한 막대한 메모리 요구 사항을 언급하며, 양자화를 통해 메모리를 크게 절감하면서도 높은 성능을 유지하는 것이 핵심 목표입니다. 다양한 백엔드를 활용하여 Transformer 및 Text Encoder 구성 요소를 4-bit 또는 8-bit로 양자화하는 구체적인 방법을 제시하고, 각 방식별 메모리 사용량과 추론 시간의 변화를 비교 분석합니다. 이를 통해 개발자들이 제한된 하드웨어 환경에서도 최신 AI 모델을 효과적으로 배포할 수 있는 실질적인 지침을 제공합니다.

핵심 포인트

Hugging Face Diffusers에서 다양한 양자화 백엔드를 활용하여 대규모 확산 모델의 메모리 효율성을 극대화할 수 있다.
양자화는 BF16 정밀도 대비 상당한 메모리 절감 효과를 가져오며, 특히 4-bit 또는 8-bit 양자화가 유용하다.
주요 구성 요소(Transformer, Text Encoder)에 대한 개별적인 양자화 적용 방법을 제시하며, bitsandbytes와 같은 라이브러리를 활용하는 구체적인 코딩 예시를 제공한다.
양자화된 모델의 성능은 시각적 비교 및 벤치마크를 통해 검증할 수 있으며, 메모리 절감과 품질 유지 사이의 균형점을 찾는 것이 중요하다.
bitsandbytes 외에도 GGUF, torchao, Quanto, native FP8 지원 등 다양한 최신 양자화 기술을 탐색하고 있다.

Hugging Face Diffusers 의 다양한 양자화 백엔드가 어떻게 작동하는지 기술적 세부사항으로 넘어가기 전에, 먼저 자신의 인식을 테스트해 보시는 것은 어떨까요?

우리는 프롬프트를 제공하고, 원래 고정밀도 모델 (예: BF16 의 Flux-dev) 과 여러 가지 양자화 버전 (BnB 4-bit, BnB 8-bit) 을 사용하여 결과를 생성할 수 있는 환경을 구축했습니다. 생성된 이미지는 사용자에게 제시되며, 그 중 어떤 것이 양자화 모델에서 생성되었는지 식별하는 것이 과제입니다.

여기서 또는 아래에서 시도해 보세요!

특히 8-bit 양자화와 관련하여 종종 차이는 미묘하며, 상세한 검사 없이는 눈에 띄지 않을 수 있습니다. 더 공격적인 양자화 (4-bit 또는 그 이하) 는 더 눈에 띄일 수 있지만, 결과는 여전히 매우 좋으며, 특히 막대한 메모리 절감 효과를 고려할 때 더욱 그렇습니다. NF4 는 항상 가장 좋은 트레이드오프를 제공합니다.

이제 더 깊이 들어가겠습니다.

우리의 이전 게시물 "Memory-efficient Diffusion Transformers with Quanto and Diffusers" 를 바탕으로, 이 게시물은 Hugging Face Diffusers 에 직접 통합된 다양한 양자화 백엔드를 탐구합니다. 우리는 bitsandbytes, GGUF, torchao, Quanto 와 native FP8 지원이 대형 및 강력한 모델을 더 접근 가능하게 만들어 Flux 와 함께 사용 방법을 보여주는 방식을 살펴보겠습니다.

양자화 백엔드로 넘어가기 전에, 우리가 양자화할 구성 요소인 FluxPipeline (black-forest-labs/FLUX.1-dev 체크포인트 사용) 과 그 구성 요소를 소개하겠습니다. 전체 FLUX.1-dev 모델을 BF16 정밀도로 로드하려면 약 31.447 GB 의 메모리가 필요합니다. 주요 구성 요소는 다음과 같습니다:

Text Encoders (CLIP and T5):

Function: 입력 텍스트 프롬프트 처리. FLUX-dev 는 초기 이해를 위해 CLIP 을 사용하며, 더 세밀한 이해와 더 나은 텍스트 렌더링을 위해 더 큰 T5 를 사용합니다.
Memory: T5 - 9.52 GB; CLIP - 246 MB (BF16)

Transformer (Main Model - MMDiT):

Function: 핵심 생성 부분 (Multimodal Diffusion Transformer). 텍스트 임베딩에서 잠재 공간에 이미지를 생성합니다.
Memory: 23.8 GB (BF16)

Variational Auto-Encoder (VAE):

Function: 픽셀과 잠재 공간 사이의 이미지 변환. 생성된 잠재 표현을 픽셀 기반 이미지로 디코딩합니다.
Memory: 168 MB (BF16)

양자화의 초점: 예시는 주로 transformer 와 text_encoder_2 (T5) 에 집중하여 가장 큰 메모리 절감 효과를 보여줍니다.

prompts = [
"Baroque style, a lavish palace interior with ornate gilded ceilings, intricate tapestries, and dramatic lighting over a grand staircase.",
"Futurist style, a dynamic spaceport with sleek silver starships docked at angular platforms, surrounded by distant planets and glowing energy lines.",
...

bitsandbytes 는 8-bit 와 4-bit 양자화에 널리 사용되는 인기 있고 사용자 친화적인 라이브러리로, LLM 과 QLoRA 미세 조정에도 널리 사용됩니다. 우리는 이를 transformer 기반의 확산 및 흐름 모델에도 사용할 수 있습니다.

| BF16 | BnB 4-bit | BnB 8-bit | Flux-dev 모델 출력 시각적 비교 (왼쪽: BF16, 중앙: BnB 4-bit, 오른쪽: BnB 8-bit 양자화). (이미지를 클릭하여 확대)

정밀도	로딩 후 메모리	피크 메모리	추론 시간
BF16	~31.447 GB	36.166 GB	12 초
...

모든 벤치마크는 1x NVIDIA H100 80GB GPU 에서 수행됨

예시 (BnB 4-bit 와 Flux-dev):

import torch
from diffusers import FluxPipeline
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
...

참고: bitsandbytes 와 함께 PipelineQuantizationConfig 를 사용할 때, diffusers 에서 DiffusersBitsAndBytesConfig 를 가져와야 합니다.

and TransformersBitsAndBytesConfig

from transformers

separately. This is because these components originate from different libraries. If you prefer a simpler setup without managing these distinct imports, you can use an alternative approach for pipeline-level quantization, an example of this method is in the Diffusers documentation on Pipeline-level quantization.

For more information check out the bitsandbytes docs.

torchao

is a PyTorch-native library for architecture optimization, offering quantization, sparsity, and custom data types, designed for compatibility with torch.compile

and FSDP. Diffusers supports a wide range of torchao

's exotic data types, enabling fine-grained control over model optimization.

| int4_weight_only | int8_weight_only | float8_weight_only | Visual comparison of Flux-dev model outputs using torchao int4_weight_only (left), int8_weight_only (center), and float8_weight_only (right) quantization. (Click on an image to zoom) |
|---|---|---|

torchao Precision	Memory after loading	Peak memory	Inference time
int4_weight_only	10.635 GB	14.654 GB	109 seconds
...

Example (Flux-dev with torchao INT8 weight-only):

@@
- from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+ from diffusers import TorchAoConfig as DiffusersTorchAoConfig
...

Example (Flux-dev with torchao INT4 weight-only):

@@
- from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+ from diffusers import TorchAoConfig as DiffusersTorchAoConfig
...

For more information check out the torchao docs.

Quanto is a quantization library integrated with the Hugging Face ecosystem via the optimum

library.

| INT4 | INT8 | FP8 | Visual comparison of Flux-dev model outputs using Quanto INT4 (left), INT8 (center), and FP8 (right) quantization. (Click on an image to zoom) |
|---|---|---|

quanto Precision	Memory after loading	Peak memory	Inference time
INT4	12.254 GB	16.139 GB	109 seconds
...

Example (Flux-dev with quanto INT8 weight-only):

@@
- from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+ from diffusers import QuantoConfig as DiffusersQuantoConfig
...

Note: At the time of writing, for float8 support with Quanto, you'll need optimum-quanto<0.2.5

and use quanto directly. We will be working on fixing this.

Example (Flux-dev with quanto FP8 weight-only)

import torch
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel
...

For more information check out the Quanto docs.

GGUF is a file format popular in the llama.cpp community for storing quantized models.

| Q2_k | Q4_1 | Q8_0 | Visual comparison of Flux-dev model outputs using GGUF Q2_k (left), Q4_1 (center), and Q8_0 (right) quantization. (Click on an image to zoom) |
|---|---|---|

GGUF Precision	Memory after loading	Peak memory	Inference time
Q2_k	13.264 GB	17.752 GB	26 seconds
...

Example (Flux-dev with GGUF Q4_1)

import torch
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
model_id = "black-forest-labs/FLUX.1-dev"
...

For more information check out the GGUF docs.

FP8 Layerwise Casting 는 메모리 최적화 기법입니다. 이 기술은 모델의 가중치를 표준 FP16 또는 BF16 정밀도보다 약 2 배 적은 메모리를 사용하는 컴팩트한 FP8 (8-bit floating point) 형식으로 저장하는 방식으로 작동합니다. 레이어가 계산을 수행하기 직전, 그 가중치는 더 높은 계산 정밀도 (예: FP16/BF16) 로 동적으로 캐스팅됩니다. 바로 그 후, 가중식은 효율적인 저장 위해 다시 FP8 로 캐스팅됩니다. 이 접근 방식은 핵심 계산을 고 정밀도로 유지하며, 양자화에 특히 민감한 레이어 (예: 정규화) 는 일반적으로 건너뛰기 때문입니다. 이 기술은 추가 메모리 절감을 위해 그룹 오프로딩과도 결합할 수 있습니다.

| FP8 (e4m3) |
Visual output of Flux-dev model using FP8 Layerwise Casting (e4m3) quantization. |

precision	Memory after loading	Peak memory	Inference time
FP8 (e4m3)	23.682 GB	28.451 GB	13 seconds

import torch
from diffusers import AutoModel, FluxPipeline
model_id = "black-forest-labs/FLUX.1-dev"
...

자세한 정보는 Layerwise casting docs 를 확인하세요.

대부분의 양자화 백엔드는 Diffusers 에서 제공하는 메모리 최적화 기법과 결합할 수 있습니다. CPU 오프로딩, 그룹 오프로딩 및 torch.compile 을 살펴보겠습니다. 이 기술들은 Diffusers 문서에서 더 자세히 배울 수 있습니다.

참고: 작성 당시, bnb +torch.compile
또한 소스 에서 설치된 경우와 pytorch nightly 또는 fullgraph=False 를 사용할 때 작동합니다.

예제 (BnB 4-bit + enable_model_cpu_offload 와 함께 Flux-dev):

import torch
from diffusers import FluxPipeline
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
...

Model CPU Offloading ( enable_model_cpu_offload): 이 방법은 추론 파이프라인 동안 전체 모델 구성 요소 (예: UNet, 텍스트 인코더 또는 VAE) 를 CPU 와 GPU 사이로 이동시킵니다. 이는 상당한 VRAM 절약을 제공하며, 더 세분화된 오프로딩보다 일반적으로 더 빠릅니다. 이는 더 적은, 더 큰 데이터 전송을 포함하기 때문입니다.

bnb + enable_model_cpu_offload:

Precision	Memory after loading	Peak memory	Inference time
4-bit	12.383 GB	12.383 GB	17 seconds
8-bit	19.182 GB	23.428 GB	27 seconds

예제 (fp8 layerwise casting + group offloading 와 함께 Flux-dev):

import torch
from diffusers import FluxPipeline, AutoModel
model_id = "black-forest-labs/FLUX.1-dev"
...

Group offloading ( enable_group_offload for diffusers components or apply_group_offloading for generic torch.nn.Modules): 이 방법은 내부 모델 레이어 그룹 (예:
torch.nn.ModuleList
또는 torch.nn.Sequential
인스턴스) 을 CPU 로 이동시킵니다. 이 접근 방식은 전체 모델 오프로딩보다 일반적으로 더 메모리 효율적이며, 순차적 오프로딩보다 빠릅니다.FP8 layerwise casting + group offloading:

precision	Memory after loading	Peak memory	Inference time
FP8 (e4m3)	9.264 GB	14.232 GB	58 seconds

예제 (torchao 4-bit + torch.compile 와 함께 Flux-dev):

import torch
from diffusers import FluxPipeline
from diffusers import TorchAoConfig as DiffusersTorchAoConfig
...

참고:torch.compile
은 미세한 수치적 차이를 도입할 수 있으며, 이미지 출력의 변경을 초래합니다.

torch.compile: PyTorch 2.x 의 torch.compile() 기능을 사용하여 모델 실행 속도를 가속화하는 또 다른 보완적인 접근법입니다. 모델을 컴파일하면 메모리를 직접적으로 낮추지는 않지만, 추론 (inference) 속도를 크게 향상시킬 수 있습니다. PyTorch 2.0 의 컴파일 (Torch Dynamo) 은 사전에 모델 그래프를 트레이싱하고 최적화합니다.

torchao + torch.compile:

torchao Precision	Memory after loading	Peak memory	Inference time	Compile Time
int4_weight_only	10.635 GB	15.238 GB	6 seconds	~285 seconds
...

여기에 벤치마킹 결과를 확인하세요:

bitsandbytes

와 torchao

양자화 (quantized) 모델을 Hugging Face 컬렉션에서 찾아보실 수 있습니다: link to collection.

양자화 백엔드를 선택하는 빠른 가이드입니다:

가장 쉬운 메모리 절감 (NVIDIA):bitsandbytes 4/8-bit 를 시작하세요. 이는 추론 속도를 더 빠르게 할 수 있도록 torch.compile() 와도 결합할 수 있습니다.추론 속도 우선:torchao, GGUF, 그리고 bitsandbytes 는 모두 torch.compile() 와 함께 사용되어 추론 속도를 향상시킬 수 있습니다.하드웨어 유연성 (CPU/MPS), FP8 정밀도:Quanto 는 좋은 옵션이 될 수 있습니다.**간단함 (Hopper/Ada):**FP8 레이어별 캐스팅 (enable_layerwise_casting) 을 확인하세요.**기존 GGUF 모델을 사용하는 경우:**GGUF 로딩 (from_single_file) 을 사용하세요.**양자화 학습에 대해 궁금한가요?**그 주제에 대한 후속 블로그 포스트를 기다려보세요! 업데이트 (2025 년 6 월 19 일): 이제 여기에 있습니다!

양자화는 대형 확산 모델 (diffusion models) 을 사용하는 진입 장벽을 크게 낮춥니다. 메모리, 속도, 품질의 균형을 찾기 위해 이러한 백엔드를 실험해 보세요.

인정: 이 포스트의 썸네일을 제공해 주신 Chunte 에 감사드립니다.

AI 자동 생성 콘텐츠

원문 바로가기