본문으둜 κ±΄λ„ˆλ›°κΈ°

Β© 2026 Molayo

HuggingFaceν—€λ“œλΌμΈ2026. 05. 07. 23:34

Remote VAEs for decoding with Inference Endpoints πŸ€—

μš”μ•½

λ³Έ 기술 κΈ°μ‚¬λŠ” 고해상도 이미지 및 λΉ„λ””μ˜€ 합성을 μœ„ν•œ 잠재 곡간 ν™•μ‚° λͺ¨λΈ μ‚¬μš© μ‹œ λ°œμƒν•˜λŠ” VAE λ””μ½”λ”μ˜ 높은 λ©”λͺ¨λ¦¬ μ†ŒλΉ„ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ 방법을 μ œμ‹œν•©λ‹ˆλ‹€. 이 λ¬Έμ œμ— λŒ€ν•œ ν•΄κ²°μ±…μœΌλ‘œ, λ””μ½”λ”© ν”„λ‘œμ„ΈμŠ€λ₯Ό 원격 μ—”λ“œν¬μΈνŠΈ(Remote Endpoint)둜 μœ„μž„ν•˜λŠ” μ‹€ν—˜μ μΈ κΈ°λŠ₯을 μ†Œκ°œν•©λ‹ˆλ‹€. 이λ₯Ό 톡해 μ‚¬μš©μžλ“€μ€ 둜컬 GPU ν™˜κ²½μ—μ„œ λ°œμƒν•˜λ˜ λ©”λͺ¨λ¦¬ μ œμ•½μ΄λ‚˜ μ§€μ—° μ‹œκ°„ μ˜€λ²„ν—€λ“œλ₯Ό ν”Όν•˜κ³  μ•ˆμ •μ μœΌλ‘œ λͺ¨λΈμ„ ꡬ동할 수 μžˆμŠ΅λ‹ˆλ‹€. 이 κΈ°λŠ₯은 `diffusers` λΌμ΄λΈŒλŸ¬λ¦¬μ— μΆ”κ°€λ˜μ—ˆμœΌλ©°, `remote_decode` 헬퍼 ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ λ‹€μ–‘ν•œ ν™•μ‚° λͺ¨λΈ(Stable Diffusion, Flux λ“±)의 λ””μ½”λ”© 과정에 μ μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ, 원격 VAE μ‚¬μš©μ˜ μž₯점 쀑 ν•˜λ‚˜λ‘œ μ—¬λŸ¬ 생성 μš”μ²­μ„ λŒ€κΈ°μ—΄ν™”(Queueing)ν•˜μ—¬ μ²˜λ¦¬ν•  수 μžˆλ‹€λŠ” 점을 κ°•μ‘°ν•©λ‹ˆλ‹€.

핡심 포인트

  • VAE 디코더가 고해상도 이미지/λΉ„λ””μ˜€ 합성에 ν•„μš”ν•œ λ©”λͺ¨λ¦¬ 뢀담이 ν¬λ‹€λŠ” λ¬Έμ œμ μ„ 해결함.
  • λ””μ½”λ”© ν”„λ‘œμ„ΈμŠ€λ₯Ό 원격 μ—”λ“œν¬μΈνŠΈλ‘œ μœ„μž„ν•˜λŠ” μƒˆλ‘œμš΄ 방식을 λ„μž…ν•˜μ—¬ 둜컬 GPU μ œμ•½μ„ μš°νšŒν•¨.
  • μƒˆλ‘œμš΄ `remote_decode` 헬퍼 ν•¨μˆ˜λ₯Ό 톡해 λ‹€μ–‘ν•œ ν™•μ‚° λͺ¨λΈ νŒŒμ΄ν”„λΌμΈμ— μ μš©ν•  수 있음.
  • 원격 VAE μ‚¬μš©μ˜ 이점 쀑 ν•˜λ‚˜λŠ” μ—¬λŸ¬ 생성 μš”μ²­μ„ 효율적으둜 λŒ€κΈ°μ—΄ν™”(Queueing) μ²˜λ¦¬ν•  수 μžˆλ‹€λŠ” μ μž„.

(This post was authored by hlky and Sayak)

When operating with latent-space diffusion models for high-resolution image and video synthesis, the VAE decoder can consume quite a bit more memory. This makes it hard for the users to run these models on consumer GPUs without going through latency sacrifices and others alike.

For example, with offloading, there is a device transfer overhead, causing delays in the overall inference latency. Tiling is another solution that lets us operate on so-called "tiles" of inputs. However, it can have a negative impact on the quality of the final image.

Therefore, we want to pilot an idea with the community β€” delegating the decoding process to a remote endpoint.

No data is stored or tracked, and code is open source. We made some changes to huggingface-inference-toolkit and use custom handlers.

This experimental feature is developed by Diffusers 🧨

Table of contents:

  • Getting started

  • Code

  • Basic example

  • Options

  • Generation

  • Queueing

  • Available VAEs

  • Advantages of using a remote VAE

  • Provide feedback

Below, we cover three use cases where we think this remote VAE inference would be beneficial.

First, we have created a helper method for interacting with Remote VAEs.

Install

diffusers

from main

to run the code.pip install git+https://github.com/huggingface/diffusers@main

Code

from diffusers.utils.remote_utils import remote_decode

Here, we show how to use the remote VAE on random tensors.

Code

image = remote_decode(
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
tensor=torch.randn([1, 4, 64, 64], dtype=torch.float16),
...

Usage for Flux is slightly different. Flux latents are packed so we need to send the height and width.

Code

image = remote_decode(
endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
tensor=torch.randn([1, 4096, 64], dtype=torch.float16),
...

Finally, an example for HunyuanVideo.

Code

video = remote_decode(
endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
tensor=torch.randn([1, 16, 3, 40, 64], dtype=torch.float16),
...

But we want to use the VAE on an actual pipeline to get an actual image, not random noise. The example below shows how to do it with SD v1.5.

Code

from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
...

Here's another example with Flux.

Code

from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell",
...

Here's an example with HunyuanVideo.

Code

from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
model_id = "hunyuanvideo-community/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
...

One of the great benefits of using a remote VAE is that we can queue multiple generation requests. While the current latent is being processed for decoding, we can already queue another one. This helps improve concurrency.

Code

import queue
import threading
from IPython.display import display
...

These tables demonstrate the VRAM requirements with different GPUs. Memory usage % determines whether users of a certain GPU will need to offload. Offload times vary with CPU, RAM and HDD/NVMe. Tiled decoding increases inference time.

SD v1.5

GPUResolutionTime (seconds)Memory (%)Tiled Time (secs)Tiled Memory (%)
NVIDIA GeForce RTX 4090512x5120.0315.60%0.031 (0%)5.60%
...

SDXL

GPUResolutionTime (seconds)Memory Consumed (%)Tiled Time (seconds)Tiled Memory (%)
NVIDIA GeForce RTX 4090512x5120.05710.00%0.057 (0%)10.00%
...

이 아이디어와 κΈ°λŠ₯을 μ’‹μ•„ν•˜μ‹ λ‹€λ©΄, Hugging Face μƒνƒœκ³„μ—μ„œ 이 κΈ°λŠ₯을 더 μ›μƒμ μœΌλ‘œ ν†΅ν•©ν•˜λŠ” 데 도움이 될 수 μžˆλŠ”μ§€ μ–΄λ–»κ²Œ κ°œμ„ ν•  수 μžˆλŠ”μ§€ ν”Όλ“œλ°±μ„ μ£Όμ‹œκ³ , μ΄λŸ¬ν•œ κΈ°λŠ₯을 μ›ν•˜μ‹œλŠ”μ§€μ— λŒ€ν•΄ μ•Œλ €μ£Όμ„Έμš”.

이 ν”Όλ‘œνŠΈκ°€ 잘 μ§„ν–‰λ˜λ©΄, 더 λ§Žμ€ λͺ¨λΈμ— λŒ€ν•΄ μ΅œμ ν™”λœ VAE μ—”λ“œν¬μΈνŠΈλ₯Ό 생성할 κ³„νšμ΄λ©°, 고해상도 λΉ„λ””μ˜€λ₯Ό 생성할 수 μžˆλŠ” λͺ¨λΈλ“€λ„ ν¬ν•¨λ©λ‹ˆλ‹€!

  • Diffusers λ₯Ό 톡해 이 λ§ν¬μ—μ„œ 이슈λ₯Ό μ—΄κΈ°.
  • μ§ˆλ¬Έμ„ λ‹΅ν•˜κ³  μ›ν•˜λŠ” μΆ”κ°€ 정보λ₯Ό μ œκ³΅ν•˜μ„Έμš”.
  • μ œμΆœν•˜κΈ°!

AI μžλ™ 생성 μ½˜ν…μΈ 

λ³Έ μ½˜ν…μΈ λŠ” Hugging Face Blog의 원문을 AIκ°€ μžλ™μœΌλ‘œ μš”μ•½Β·λ²ˆμ—­Β·λΆ„μ„ν•œ κ²ƒμž…λ‹ˆλ‹€. 원 μ €μž‘κΆŒμ€ μ›μ €μž‘μžμ—κ²Œ 있으며, μ •ν™•ν•œ λ‚΄μš©μ€ λ°˜λ“œμ‹œ 원문을 확인해 μ£Όμ„Έμš”.

원문 λ°”λ‘œκ°€κΈ°
1

λŒ“κΈ€

0