ZeroGPU Spaces 을 앞선 컴파일로 속도 향상

Nvidia H200 하드웨어를 Hugging Face Spaces 에서 사용하되, 대기 트래픽으로 인해 GPU 를 잠그지 않습니다. 이는 효율적이며 유연하며 데모에 이상적이지만, GPU 와 CUDA 스택이 제공할 수 있는 모든 것을 항상 완전히 활용하지는 못합니다. 이미지나 비디오를 생성하는 데 상당한 시간이 걸릴 수 있습니다. H200 하드웨어의 이점을 최대한 활용하여 더 많은 성능을 끌어내는 것은 중요한 문제입니다.

이는 PyTorch 앞선 컴파일 (Ahead-of-Time, AoT) 이 필요한 곳입니다. 모델을 실시간으로 컴파일하는 것 (ZeroGPU 의 짧은 수명 프로세스와 잘 맞지 않음) 대신, AoT 는 한 번 최적화하고 즉시 다시 로드할 수 있게 합니다.

결과: Flux, Wan, LTX 와 같은 모델에서 **1.3×–1.8×**의 속도 향상을 보이는 더 빠르고 부드러운 데모 및 경험 🚀

이 글에서는 ZeroGPU Spaces 에서 앞선 컴파일 (AoT) 을 어떻게 설정하는지 보여줄 것입니다. FP8 양자화 및 동적 형식과 같은 고급 기술을 탐구하고, 바로 시도해 볼 수 있는 작동 데모를 공유할 것입니다. 기다릴 시간이 없다면 zerogpu-aoti 조직의 ZeroGPU 기반 데모를 확인하세요.

Pro 사용자 및 Team/Enterprise 조직 구성원만 ZeroGPU Spaces 를 만들 수 있지만, 누구나 자유롭게 사용할 수 있습니다 (Pro, Team 및 Enterprise 사용자는 8 배 더 많은 ZeroGPU 할당량을 받습니다).

ZeroGPU 는 무엇인가요?
PyTorch 컴파일
ZeroGPU 에서 앞선 컴파일
주의할 점
AoT 컴파일된 ZeroGPU Spaces 데모
결론
리소스

Spaces 는 Hugging Face 가 제공하는 ML 전문가가 데모 애플리케이션을 쉽게 게시할 수 있는 플랫폼입니다.

Typical 데모 앱은 다음과 같습니다:

import gradio as gr
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(...).to('cuda')
...

이것은 훌륭하지만, 공간의 전체 수명 동안 GPU 를 예약하게 됩니다. 사용자 활동이 없는 경우에도.

이 줄에서 .to('cuda') 를 실행할 때:

pipe = DiffusionPipeline.from_pretrained(...).to('cuda')

PyTorch 는 NVIDIA 드라이버를 초기화하여 프로세스를 CUDA 에 영구적으로 설정합니다. 트래픽이 완벽하게 매끄럽지 않고 매우 희소하고 가파른 스파이크 형태이기 때문에 이는 자원 효율성이 낮습니다.

ZeroGPU 는 GPU 초기화에 대한 just-in-time 접근 방식을 사용합니다. 메인 프로세스를 CUDA 에 설정하는 대신, 자동으로 프로세스를 포크하여 CUDA 에서 설정하고 GPU 작업을 실행한 후 GPU 를 방출해야 할 때 포크를 종료합니다.

이는 다음을 의미합니다:

앱이 트래픽을 받지 않을 때는 GPU 를 사용하지 않습니다.
실제로 작업 수행 중일 때는 1 개의 GPU 를 사용합니다.
필요에 따라 여러 GPU 를 병렬로 사용 가능합니다.

Python spaces 패키지를 통해 이 동작을 얻기 위해 필요한 코드 변경은 다음과 같습니다:

import gradio as gr
+ import spaces
from diffusers import DiffusionPipeline
...

spaces 를 임포트하고 @spaces.GPU 장식을 추가하면:

PyTorch API 호출을 포지션하여 CUDA 작업을 지연시킵니다.
나중에 호출될 때 장식된 함수를 포크에서 실행합니다.
(포크에 올바른 디바이스를 표시하기 위해 내부 API 를 호출하지만 이는 이 블로그 포스트의 범위를 벗어납니다)

ZeroGPU 는 현재 H200 의 MIG 슬라이스 (3g.71gb 프로필) 를 할당합니다. 추가 MIG 크기 (전체 슬라이스 7g.141gb 프로필 포함) 는 2025 년 말에 출시됩니다.

Modern ML frameworks like PyTorch and JAX have the concept of compilation that can be used to optimize model latency or inference time. Behind the scenes, compilation applies a series of (often hardware-dependent) optimization steps such as operator fusion, constant folding, etc.

PyTorch (from 2.0 onwards) currently has two major interfaces for compilation:

Just-in-time with torch.compile
Ahead-of-time with torch.export + AOTInductor

torch.compile

works great in standard environments: it compiles your model the first time it runs, and reuses the optimized version for subsequent calls.

However, on ZeroGPU, given that the process is freshly spun up for (almost) every GPU task, it means that torch.compile

can't efficiently re-use compilation and is thus forced to rely on its filesystem cache to restore compiled models. Depending on the model being compiled, this process takes from a few dozen seconds to a couple of minutes, which is way too much for practical GPU tasks in Spaces.

This is where ahead-of-time (AoT) compilation shines.

With AoT, we can export a compiled model once, save it, and later reload it instantly in any process, which is exactly what we need for ZeroGPU. This helps us reduce framework overhead and also eliminates cold-start timings typically incurred in just-in-time compilation.

But how can we do ahead-of-time compilation on ZeroGPU? Let's dive in.

Let's go back to our ZeroGPU base example and unpack what we need to enable AoT compilation. For the purpose of this demo, we will use the black-forest-labs/FLUX.1-dev

model:

import gradio as gr
import spaces
import torch
...

In the discussion below, we only compile the

transformer

component of pipe
since, in these generative models, the transformer (or more generally, the denoiser) is the most computationally heavy component.

Compiling a model ahead-of-time with PyTorch involves multiple steps:

Recall that we're going to compile the model ahead of time. Therefore, we need to derive example inputs for the model. Note that these are the same kinds of inputs we expect to see during the actual runs. To capture those inputs, we will leverage the spaces.aoti_capture

helper from the spaces

package:

with spaces.aoti_capture(pipe.transformer) as call:
pipe(

compiled_transformer = spaces.aoti_compile(exported_transformer)

이 `compiled_transformer`

은 이제 추론에 사용 준비가 된 AoT 컴파일된 바이너리가 되었습니다.

이제 우리는 컴파일된 트랜스포머를 원래 파이프라인, 즉 `pipeline`

에 결합해야 합니다.

간단하고 거의 작동하는 접근 방식은 파이프라인을 단순히 패치하는 것입니다. `pipe.transformer = compiled_transformer`

입니다. 그러나 이 접근 방식은 중요한 속성인 `dtype`

, `config`

등을 삭제하기 때문에 작동하지 않습니다. `forward`

메서드만 패치하는 것도 효과가 없으며, 이는 원본 모델 파라미터를 메모리에 유지하게 되어 런타임에 OOM 오류로 이어집니다.

`spaces`

패키지는 이를 위한 유틸리티도 제공합니다 -- `spaces.aoti_apply`

:

spaces.aoti_apply(compiled_transformer, pipe.transformer)


이제 완료되었습니다! 이는 컴파일된 모델을 `pipe.transformer.forward`

에 패치하고 메모리에서 오래된 모델 파라미터를 정리하는 것을 처리합니다.

첫 번째 세 단계 (입력 예시 인터셉팅, 모델 내보내기, PyTorch 인덕터로 컴파일) 를 수행하려면 실제 GPU 가 필요합니다. `@spaces.GPU`

함수 외부에서 얻는 CUDA 에뮬레이션만으로는 부족하며, 컴파일은 하드웨어에 의존적이기 때문입니다. 예를 들어, 생성된 코드를 튜닝하기 위해 마이크로 벤치마크 실행을 신뢰합니다. 이것이 바로 모든 것을 `@spaces.GPU`

함수로 감싸서 애플리케이션의 루트에 컴파일된 모델을 다시 가져와야 하는 이유입니다. 원래 데모 코드에서 시작하면 다음과 같습니다:

import gradio as gr
import spaces
import torch
...


추가 코드가 몇 줄로 이루어져 있을 뿐, 우리는 데모를 성공적으로 **1.7 배** 더 빠르게 만들었습니다 (FLUX.1-dev 의 경우).

AoT 컴파일링에 대해 더 알고 싶다면 PyTorch 의 AOTInductor 튜토리얼을 읽으세요.

이제 ZeroGPUs 제약 조건 하에서 실현할 수 있는 속도 향상을 보여준 후, 이 설정과 작업하면서 발생했던 몇 가지 주의점을 논의하겠습니다.

AoT 는 양자화 (quantization) 와 결합하여 더 큰 속도 향상을 제공할 수 있습니다. 이미지 및 비디오 생성의 경우, FP8 포스트 트레이닝 동적 양자화 스키마는 좋은 속도 - 품질 균형을 제공합니다. 그러나 FP8 은 최소 9.0 의 CUDA 컴퓨팅 능력을 필요로 합니다. 다행히도 ZeroGPUs 는 H200s 에 기반하므로 이미 FP8 양자화 스키마를 활용할 수 있습니다.

AoT 컴파일 워크플로우 내에서 FP8 양자화를 활성화하려면 `torchao`

가 제공하는 API 를 다음과 같이 활용할 수 있습니다:

from torchao.quantization import quantize_, Float8DynamicActivationFloat8WeightConfig
내보내기 단계 바로 전 트랜스포머를 양자화합니다.
quantize_(pipe.transformer, Float8DynamicActivationFloat8WeightConfig())
...


(여기에 대해 더 자세한 내용은 TorchAO 를 찾아보세요.)

그리고 우리는 위에서 설명한 대로 나머지 단계를 진행할 수 있습니다. 양자화를 사용하면 또 다른 **1.2 배**의 속도 향상을 제공합니다.

이미지 및 비디오는 서로 다른 모양과 크기를 가질 수 있습니다. 따라서 AoT 컴파일링을 수행할 때 동적 모양 (shape dynamism) 을 고려하는 것도 중요합니다. `torch.export.export`

가 제공하는 원시 (primitives) 는 동적 모양에 대해 어떻게 처리해야 할지 쉽게 구성 가능하게 제공합니다.

Flux.1-Dev 트랜스포머의 경우, 다른 이미지 해상도 변화는 두 가지 `forward`

아르그먼트에 영향을 미칩니다:

`hidden_states`

: 트랜스포머가 제거해야 하는 노이즈 입력 잠재 (latents) 입니다. 3D 텐서로, `batch_size`, `flattened_latent_dim`, `embed_dim` 을 나타냅니다.

When the batch size is fixed, it's the `flattened_latent_dim` that will change for any changes made to image resolutions.

`img_ids`: A 2D array of encoded pixel coordinates having a shape of `height * width, 3`.

In this case, we want to make `height * width` dynamic.

We start by defining a range in which we want to let the (latent) image resolutions vary. To derive these value ranges, we inspected the shapes of `hidden_states` in the pipeline with respect to varied image resolutions. The exact values are model-dependent and require manual inspection and some intuition. For Flux.1-Dev, we ended up with:

transformer_hidden_dim = torch.export.Dim('hidden', min=4096, max=8212)


We then define a map of argument names and which dimensions in their input values we expect to be dynamic:

transformer_dynamic_shapes = {
"hidden_states": {1: transformer_hidden_dim},
"img_ids": {0: transformer_hidden_dim},
...


Then we need to make our dynamic shapes object replicate the structure of our example inputs. The inputs that do not need dynamic shapes must be set to `None`. This can be done very easily with PyTorch tree_map utility:

from torch.utils._pytree import tree_map
dynamic_shapes = tree_map(lambda v: None, call.kwargs)
dynamic_shapes |= transformer_dynamic_shapes


Now, when performing the export step, we simply supply `transformer_dynamic_shapes` to `torch.export.export`:

exported_transformer = torch.export.export(
pipe.transformer,
args=call.args,
...


Check out this Space that shows how to use both quantization and dynamic shapes during the export step.

Dynamic shapes is sometimes not enough when dynamism is too important.

This is, for instance, the case with the Wan family of video generation models if you want your compiled model to generate different resolutions. One thing can be done in this case: compile one model per resolution while keeping the model parameters shared and dispatching the right one at runtime

Here is a minimal example of this approach: zerogpu-aoti-multi.py. You can also see a fully working implementation of this paradigm in the Wan 2.2 Space.

Since the ZeroGPU hardware and CUDA drivers are perfectly compatible with Flash-Attention 3 (FA3), we can use it in our ZeroGPU Spaces to speed things up even further. FA3 works with ahead-of-time compilation. So, this is ideal for our case.

Compiling and building FA3 from source can take several minutes, and this process is hardware-dependent. As users, we wouldn't want to lose precious ZeroGPU compute hours. This is where Hugging Face `kernels` library comes to the rescue. It provides access to pre-built kernels that are compatible for a given hardware. For example, when we try to run:

from kernels import get_kernel
vllm_flash_attn3 = get_kernel("kernels-community/vllm-flash-attn3")


It tries to load a kernel from the `kernels-community/vllm-flash-attn3` repository, which is compatible with the current setup. Otherwise, it will error out due to incompatibility issues. Luckily for us, this works seamlessly on the ZeroGPU Spaces. This means we can leverage the power of FA3 on ZeroGPU, using the `kernels` library.

Here is a fully working example of an FA3 attention processor for the Qwen-Image model.

So far, we have been compiling the full model. Depending on the model, full model compilation can lead to significantly long cold start times. Long cold start times make the development experience unpleasant.

ZeroGPU Spaces 을 앞선 컴파일로 속도 향상

요약

핵심 포인트

내보내기 단계 바로 전 트랜스포머를 양자화합니다.

댓글