Remote VAEs for decoding with Inference Endpoints π€
μμ½
λ³Έ κΈ°μ κΈ°μ¬λ κ³ ν΄μλ μ΄λ―Έμ§ λ° λΉλμ€ ν©μ±μ μν μ μ¬ κ³΅κ° νμ° λͺ¨λΈ μ¬μ© μ λ°μνλ VAE λμ½λμ λμ λ©λͺ¨λ¦¬ μλΉ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν λ°©λ²μ μ μν©λλ€. μ΄ λ¬Έμ μ λν ν΄κ²°μ± μΌλ‘, λμ½λ© νλ‘μΈμ€λ₯Ό μ격 μλν¬μΈνΈ(Remote Endpoint)λ‘ μμνλ μ€νμ μΈ κΈ°λ₯μ μκ°ν©λλ€. μ΄λ₯Ό ν΅ν΄ μ¬μ©μλ€μ λ‘컬 GPU νκ²½μμ λ°μνλ λ©λͺ¨λ¦¬ μ μ½μ΄λ μ§μ° μκ° μ€λ²ν€λλ₯Ό νΌνκ³ μμ μ μΌλ‘ λͺ¨λΈμ ꡬλν μ μμ΅λλ€. μ΄ κΈ°λ₯μ `diffusers` λΌμ΄λΈλ¬λ¦¬μ μΆκ°λμμΌλ©°, `remote_decode` ν¬νΌ ν¨μλ₯Ό μ¬μ©νμ¬ λ€μν νμ° λͺ¨λΈ(Stable Diffusion, Flux λ±)μ λμ½λ© κ³Όμ μ μ μ©ν μ μμ΅λλ€. λν, μ격 VAE μ¬μ©μ μ₯μ μ€ νλλ‘ μ¬λ¬ μμ± μμ²μ λκΈ°μ΄ν(Queueing)νμ¬ μ²λ¦¬ν μ μλ€λ μ μ κ°μ‘°ν©λλ€.
ν΅μ¬ ν¬μΈνΈ
- VAE λμ½λκ° κ³ ν΄μλ μ΄λ―Έμ§/λΉλμ€ ν©μ±μ νμν λ©λͺ¨λ¦¬ λΆλ΄μ΄ ν¬λ€λ λ¬Έμ μ μ ν΄κ²°ν¨.
- λμ½λ© νλ‘μΈμ€λ₯Ό μ격 μλν¬μΈνΈλ‘ μμνλ μλ‘μ΄ λ°©μμ λμ νμ¬ λ‘컬 GPU μ μ½μ μ°νν¨.
- μλ‘μ΄ `remote_decode` ν¬νΌ ν¨μλ₯Ό ν΅ν΄ λ€μν νμ° λͺ¨λΈ νμ΄νλΌμΈμ μ μ©ν μ μμ.
- μ격 VAE μ¬μ©μ μ΄μ μ€ νλλ μ¬λ¬ μμ± μμ²μ ν¨μ¨μ μΌλ‘ λκΈ°μ΄ν(Queueing) μ²λ¦¬ν μ μλ€λ μ μ.
(This post was authored by hlky and Sayak)
When operating with latent-space diffusion models for high-resolution image and video synthesis, the VAE decoder can consume quite a bit more memory. This makes it hard for the users to run these models on consumer GPUs without going through latency sacrifices and others alike.
For example, with offloading, there is a device transfer overhead, causing delays in the overall inference latency. Tiling is another solution that lets us operate on so-called "tiles" of inputs. However, it can have a negative impact on the quality of the final image.
Therefore, we want to pilot an idea with the community β delegating the decoding process to a remote endpoint.
No data is stored or tracked, and code is open source. We made some changes to huggingface-inference-toolkit and use custom handlers.
This experimental feature is developed by Diffusers π§¨
Table of contents:
-
Getting started
-
Code
-
Basic example
-
Options
-
Generation
-
Queueing
-
Available VAEs
-
Advantages of using a remote VAE
-
Provide feedback
Below, we cover three use cases where we think this remote VAE inference would be beneficial.
First, we have created a helper method for interacting with Remote VAEs.
Install
diffusers
from main
to run the code.pip install git+https://github.com/huggingface/diffusers@main
Code
from diffusers.utils.remote_utils import remote_decode
Here, we show how to use the remote VAE on random tensors.
Code
image = remote_decode(
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
tensor=torch.randn([1, 4, 64, 64], dtype=torch.float16),
...
Usage for Flux is slightly different. Flux latents are packed so we need to send the height and width.
Code
image = remote_decode(
endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
tensor=torch.randn([1, 4096, 64], dtype=torch.float16),
...
Finally, an example for HunyuanVideo.
Code
video = remote_decode(
endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
tensor=torch.randn([1, 16, 3, 40, 64], dtype=torch.float16),
...
But we want to use the VAE on an actual pipeline to get an actual image, not random noise. The example below shows how to do it with SD v1.5.
Code
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
...
Here's another example with Flux.
Code
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell",
...
Here's an example with HunyuanVideo.
Code
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
model_id = "hunyuanvideo-community/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
...
One of the great benefits of using a remote VAE is that we can queue multiple generation requests. While the current latent is being processed for decoding, we can already queue another one. This helps improve concurrency.
Code
import queue
import threading
from IPython.display import display
...
These tables demonstrate the VRAM requirements with different GPUs. Memory usage % determines whether users of a certain GPU will need to offload. Offload times vary with CPU, RAM and HDD/NVMe. Tiled decoding increases inference time.
SD v1.5
| GPU | Resolution | Time (seconds) | Memory (%) | Tiled Time (secs) | Tiled Memory (%) |
|---|---|---|---|---|---|
| NVIDIA GeForce RTX 4090 | 512x512 | 0.031 | 5.60% | 0.031 (0%) | 5.60% |
| ... |
SDXL
| GPU | Resolution | Time (seconds) | Memory Consumed (%) | Tiled Time (seconds) | Tiled Memory (%) |
|---|---|---|---|---|---|
| NVIDIA GeForce RTX 4090 | 512x512 | 0.057 | 10.00% | 0.057 (0%) | 10.00% |
| ... |
μ΄ μμ΄λμ΄μ κΈ°λ₯μ μ’μνμ λ€λ©΄, Hugging Face μνκ³μμ μ΄ κΈ°λ₯μ λ μμμ μΌλ‘ ν΅ν©νλ λ° λμμ΄ λ μ μλμ§ μ΄λ»κ² κ°μ ν μ μλμ§ νΌλλ°±μ μ£Όμκ³ , μ΄λ¬ν κΈ°λ₯μ μνμλμ§μ λν΄ μλ €μ£ΌμΈμ.
μ΄ νΌλ‘νΈκ° μ μ§νλλ©΄, λ λ§μ λͺ¨λΈμ λν΄ μ΅μ νλ VAE μλν¬μΈνΈλ₯Ό μμ±ν κ³νμ΄λ©°, κ³ ν΄μλ λΉλμ€λ₯Ό μμ±ν μ μλ λͺ¨λΈλ€λ ν¬ν¨λ©λλ€!
- Diffusers λ₯Ό ν΅ν΄ μ΄ λ§ν¬μμ μ΄μλ₯Ό μ΄κΈ°.
- μ§λ¬Έμ λ΅νκ³ μνλ μΆκ° μ 보λ₯Ό μ 곡νμΈμ.
- μ μΆνκΈ°!
AI μλ μμ± μ½ν μΈ
λ³Έ μ½ν μΈ λ Hugging Face Blogμ μλ¬Έμ AIκ° μλμΌλ‘ μμ½Β·λ²μΒ·λΆμν κ²μ λλ€. μ μ μκΆμ μμ μμμκ² μμΌλ©°, μ νν λ΄μ©μ λ°λμ μλ¬Έμ νμΈν΄ μ£ΌμΈμ.
μλ¬Έ λ°λ‘κ°κΈ°