HuggingFace헤드라인2026. 05. 07. 23:34

Remote VAEs for decoding with Inference Endpoints 🤗

요약

본 기술 기사는 고해상도 이미지 및 비디오 합성을 위한 잠재 공간 확산 모델 사용 시 발생하는 VAE 디코더의 높은 메모리 소비 문제를 해결하기 위한 방법을 제시합니다. 이 문제에 대한 해결책으로, 디코딩 프로세스를 원격 엔드포인트(Remote Endpoint)로 위임하는 실험적인 기능을 소개합니다. 이를 통해 사용자들은 로컬 GPU 환경에서 발생하던 메모리 제약이나 지연 시간 오버헤드를 피하고 안정적으로 모델을 구동할 수 있습니다. 이 기능은 `diffusers` 라이브러리에 추가되었으며, `remote_decode` 헬퍼 함수를 사용하여 다양한 확산 모델(Stable Diffusion, Flux 등)의 디코딩 과정에 적용할 수 있습니다. 또한, 원격 VAE 사용의 장점 중 하나로 여러 생성 요청을 대기열화(Queueing)하여 처리할 수 있다는 점을 강조합니다.

핵심 포인트

VAE 디코더가 고해상도 이미지/비디오 합성에 필요한 메모리 부담이 크다는 문제점을 해결함.
디코딩 프로세스를 원격 엔드포인트로 위임하는 새로운 방식을 도입하여 로컬 GPU 제약을 우회함.
새로운 `remote_decode` 헬퍼 함수를 통해 다양한 확산 모델 파이프라인에 적용할 수 있음.
원격 VAE 사용의 이점 중 하나는 여러 생성 요청을 효율적으로 대기열화(Queueing) 처리할 수 있다는 점임.

(This post was authored by hlky and Sayak)

When operating with latent-space diffusion models for high-resolution image and video synthesis, the VAE decoder can consume quite a bit more memory. This makes it hard for the users to run these models on consumer GPUs without going through latency sacrifices and others alike.

For example, with offloading, there is a device transfer overhead, causing delays in the overall inference latency. Tiling is another solution that lets us operate on so-called "tiles" of inputs. However, it can have a negative impact on the quality of the final image.

Therefore, we want to pilot an idea with the community — delegating the decoding process to a remote endpoint.

No data is stored or tracked, and code is open source. We made some changes to huggingface-inference-toolkit and use custom handlers.

This experimental feature is developed by Diffusers 🧨

Table of contents:

Getting started
Code
Basic example
Options
Generation
Queueing
Available VAEs
Advantages of using a remote VAE
Provide feedback

Below, we cover three use cases where we think this remote VAE inference would be beneficial.

First, we have created a helper method for interacting with Remote VAEs.

Install

diffusers

from main

to run the code.pip install git+https://github.com/huggingface/diffusers@main

Code

from diffusers.utils.remote_utils import remote_decode

Here, we show how to use the remote VAE on random tensors.

Code

image = remote_decode(
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
tensor=torch.randn([1, 4, 64, 64], dtype=torch.float16),
...

Usage for Flux is slightly different. Flux latents are packed so we need to send the height and width.

Code

image = remote_decode(
endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
tensor=torch.randn([1, 4096, 64], dtype=torch.float16),
...

Finally, an example for HunyuanVideo.

Code

video = remote_decode(
endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
tensor=torch.randn([1, 16, 3, 40, 64], dtype=torch.float16),
...

But we want to use the VAE on an actual pipeline to get an actual image, not random noise. The example below shows how to do it with SD v1.5.

Code

from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
...

Here's another example with Flux.

Code

from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell",
...

Here's an example with HunyuanVideo.

Code

from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
model_id = "hunyuanvideo-community/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
...

One of the great benefits of using a remote VAE is that we can queue multiple generation requests. While the current latent is being processed for decoding, we can already queue another one. This helps improve concurrency.

Code

import queue
import threading
from IPython.display import display
...

These tables demonstrate the VRAM requirements with different GPUs. Memory usage % determines whether users of a certain GPU will need to offload. Offload times vary with CPU, RAM and HDD/NVMe. Tiled decoding increases inference time.

SD v1.5

GPU	Resolution	Time (seconds)	Memory (%)	Tiled Time (secs)	Tiled Memory (%)
NVIDIA GeForce RTX 4090	512x512	0.031	5.60%	0.031 (0%)	5.60%
...

SDXL

GPU	Resolution	Time (seconds)	Memory Consumed (%)	Tiled Time (seconds)	Tiled Memory (%)
NVIDIA GeForce RTX 4090	512x512	0.057	10.00%	0.057 (0%)	10.00%
...

이 아이디어와 기능을 좋아하신다면, Hugging Face 생태계에서 이 기능을 더 원생적으로 통합하는 데 도움이 될 수 있는지 어떻게 개선할 수 있는지 피드백을 주시고, 이러한 기능을 원하시는지에 대해 알려주세요.

이 피로트가 잘 진행되면, 더 많은 모델에 대해 최적화된 VAE 엔드포인트를 생성할 계획이며, 고해상도 비디오를 생성할 수 있는 모델들도 포함됩니다!

Diffusers 를 통해 이 링크에서 이슈를 열기.
질문을 답하고 원하는 추가 정보를 제공하세요.
제출하기!

AI 자동 생성 콘텐츠

원문 바로가기

Remote VAEs for decoding with Inference Endpoints 🤗

요약

핵심 포인트

Code

Code

Code

Code

Code

Code

Code

Code

SD v1.5

SDXL

댓글