Blaizzy/mlx-vlm

MLX-VLM 은 MLX 를 사용하여 Mac 에서 Vision Language Models (VLMs) 과 Omni Models (오디오 및 비디오 지원 VLMs) 의 추론과 미세 조정용 패키지입니다.

Installation
Usage
Activation Quantization (CUDA)
Multi-Image Chat Support
Model-Specific Documentation
Vision Feature Caching
TurboQuant KV Cache
Distributed Inference
Fine-tuning

일부 모델은 프롬프트 형식, 예시 및 모범 사례가 포함된 상세 문서를 제공합니다:

Model	Documentation
DeepSeek-OCR	Docs
...

mlx-vlm 패키지를 pip 를 사용하여 설치하는 것이 시작하는 가장 쉬운 방법입니다:

pip install -U mlx-vlm

CLI 를 사용하여 모델에서 출력을 생성합니다:

# Text generation
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Hello, how are you?"
# Image generation
...

생각 모델 (예: Qwen3.5) 의 경우, thinking block 에서 소비되는 토큰 수를 제한할 수 있습니다:

mlx_vlm.generate --model mlx-community/Qwen3.5-2B-4bit \
--thinking-budget 50 \
--thinking-start-token "" \
...

Flag	Description
`--enable-thinking`
채팅 템플릿에서 thinking mode 를 활성화합니다
`--thinking-budget`
thinking block 내부에 허용되는 최대 토큰 수
`--thinking-start-token`
thinking block 을 열는 토큰 (기본값: `` )
`--thinking-end-token`
thinking block 을 닫는 토큰 (기본값: `<end>`)

예산이 초과되면, 모델은 \n<end> 를 출력하도록 강제로 하며 답변으로 전환합니다. --enable-thinking 이 전달되더라도 모델의 채팅 템플릿이 이를 지원하지 않는 경우, 예산은 모델이 시작 토큰을 자체적으로 생성할 때에만 적용됩니다.

서버에서 thinking mode 는 기본값으로 비활성화됩니다. --enable-thinking 을 사용하여 thinking mode 를 요청의 기본값으로 설정합니다:

mlx_vlm.server --model Qwen/Qwen3.5-4B --enable-thinking

요청에서는 enable_thinking: true 또는 enable_thinking: false 로 서버 기본값을 덮어쓸 수 있습니다.

작은 "drafter" 모델로 여러 후보 토큰을 작성하고 단일 타겟 포워드 패스에서 이를 확인하여 생성 속도를 높입니다. 두 가지 drafter 가 지원됩니다.

Flag	Description
`--draft-model`
drafter 의 HuggingFace repo 또는 로컬 경로
`--draft-kind`
drafter family — `dflash` (기본값) 또는 `mtp` (Gemma 4)
`--draft-block-size`
drafter 의 설정된 block size 를 덮어씁니다

Python API 예시 및 배치 생성을 포함하여 docs/usage.md 를 참조하세요.

멀티 토큰 예측 (Multi-Token Prediction): 목표 모델과 K/V 를 공유하고, 고정된 위치에서 자기회귀적으로 여러 토큰을 생성하는 구글의 4 레이어 'assistant' drafter. --draft-kind mtp를 사용하여 MTP 라운드 루프를 실행합니다.

mlx_vlm.generate --model mlx-community/gemma-4-31B-it-bf16 \
--draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
--draft-kind mtp --draft-block-size 4 \
...

지원하는 조합 (목표 ↔ drafter):

목표	drafter
`mlx-community/gemma-4-E2B-it-bf16`
`mlx-community/gemma-4-E2B-it-assistant-bf16`
`mlx-community/gemma-4-E4B-it-bf16`
`mlx-community/gemma-4-E4B-it-assistant-bf16`
`mlx-community/gemma-4-26B-A4B-it-bf16`
`mlx-community/gemma-4-26B-A4B-it-assistant-bf16`
`mlx-community/gemma-4-31B-it-bf16`
`mlx-community/gemma-4-31B-it-assistant-bf16`

측정된 속도 향상 (greedy, byte-identical output): 26B-A4B 에서 최대 3.94×, B=4 에서 2.29×. 자세한 sweep 및 아키텍처 노트는 mlx_vlm/speculative/drafters/gemma4_assistant/README.md 참조.

Gradio 를 사용하여 채팅 인터페이스를 시작합니다:

mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit

MLX-VLM 을 파이썬 스크립트에서 사용하는 방법의 예제입니다:

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
...

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
...

서버를 시작합니다:

mlx_vlm.server --port 8080
# 서버 시작 시 모델 미리 로드 (Hugging Face repo 또는 로컬 경로)
mlx_vlm.server --model <hf_repo_or_local_path>
...

--model
: 서버 시작 시 모델 미리 로드, Hugging Face repo ID 또는 로컬 경로를 허용 (선택 사항, 생략 시 첫 요청 시 지연 로드)
--adapter-path
: 미리 로드된 모델과 함께 사용할 어댑터 중량 경로
--draft-model
: 추측 drafter 경로 또는 HF id (예:z-lab/Qwen3.5-4B-DFlash, google/gemma-4-31B-it-assistant) — ~2× 또는 더 높은 throughput 을 위한 speculative decoding 가능
--draft-kind
: drafter 가족 —dflash(기본값) 또는mtp

(Gemma 4)--draft-block-size

: Override the drafter's configured block size--host

: Host address (default:0.0.0.0
)

--port

: Port number (default:8080
)

--trust-remote-code

: Trust remote code when loading models from Hugging Face Hub--enable-thinking

: Enable thinking mode by default for requests that do not setenable_thinking

--kv-bits

: Number of bits for KV cache quantization (e.g.8
for uniform,3.5
for TurboQuant)--kv-quant-scheme

: KV cache quantization backend (uniform
orturboquant
)

--kv-group-size

: Group size for uniform KV cache quantization (default:64
)

--max-kv-size

: Maximum KV cache size in tokens--vision-cache-size

: Max number of cached vision features (default:20
)

--log-level

: Logging level —DEBUG
,INFO
,WARNING
,ERROR
,CRITICAL
(default:INFO
)

You can also set trust remote code via environment variable:

MLX_TRUST_REMOTE_CODE=true mlx_vlm.server

The server provides multiple endpoints for different use cases and supports dynamic model loading/unloading with caching (one model at a time).

The server supports continuous batching for higher throughput when handling multiple concurrent requests. New requests join the active batch immediately without waiting for existing requests to finish, and mixed batches of image and text-only requests are supported.

Continuous batching is enabled automatically when the server loads a model. You can pre-load a model at startup so it's ready to serve immediately:

mlx_vlm.server --port 8080 --model mlx-community/Qwen2.5-VL-3B-Instruct-4bit

Verify via the health endpoint:

curl http://localhost:8080/health
# {"status":"healthy","loaded_model":"...","apc_enabled":false}

If --model
is omitted, the model is loaded on the first request.

Automatic Prefix Caching reuses block-level K/V cache state across requests that share the same prefix. It is useful for repeated long documents, long chat histories, or retrieval contexts where each request appends a short new suffix.

APC has two tiers:

Warm memory: keeps reusableAPCBlock
tensors in process memory. This is the fastest path, but it keeps both the reusable block pool and the runtimeKVCache
.Warm disk: persists cached prefixes as safetensors shards so they survive process restarts. Warm-disk reads build the layer-major prompt cache directly without promoting restored blocks into theAPCBlock
pool; writes can still populate both memory and disk tiers.

APCManager 직접 사용

stream_generate 호출 시 APCManager 를 직접 사용하세요:

from pathlib import Path
from mlx_vlm import load, stream_generate
from mlx_vlm.apc import APCManager, DiskBlockStore
...

모델과 비교하여 차가운 메모리 (cold), 따뜻한 메모리 (warm-memory), 따뜻한 디스크 (warm-disk), 디스크 추방 (disk-eviction) 동작을 확인하려면 동일한 직접 API 경로를 사용하세요:

import os
import tempfile
import time
...

서버에 인메모리 APC 를 활성화하려면 환경 변수를 사용하세요:

APC_ENABLED=1 \
APC_NUM_BLOCKS=4096 \
mlx_vlm.server --model Qwen/Qwen3-VL-4B-Instruct --port 8080

지속 가능한 디스크 계층을 활성화하세요:

APC_ENABLED=1 \
APC_NUM_BLOCKS=4096 \
APC_DISK_PATH=~/.cache/mlx-vlm/caching \
...

동일한 긴 접두사를 가진 반복 요청은 자동으로 APC 를 사용하게 됩니다:

curl -X POST "http://localhost:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "X-APC-Tenant: demo" \
...

동일한 X-APC-Tenant 값을 사용하여 캐시된 접두사를 공유할 수 있는 요청에 사용하세요. 사용자 또는 워크스페이스 간 캐시 항목을 격리하기 위해 다른 tenant 값을 사용하세요.

APC 상태를 확인 및 초기화:

curl http://localhost:8080/v1/cache/stats
curl -X POST http://localhost:8080/v1/cache/reset

일반 APC 환경 변수:

Variable	Default	Description
`APC_ENABLED`	`0`	`1`로 설정하여 APC 활성화
`APC_NUM_BLOCKS`	`2048`	인메모리 APC 블록 수
`APC_BLOCK_SIZE`	`16`	APC 블록당 토큰 수
`APC_DISK_PATH`	unset	지속 가능한 디스크 쉘드용 디렉터리
`APC_DISK_MAX_GB`	`0`	GB 단위의 디스크 제한; `0` 은 제한 없음
`APC_DISK_SHARD_MAX_BLOCKS`	`256`	디스크 세그먼트 쉘드당 최대 블록 수
`APC_MAX_POOL_TENSORS`	`450000`	Metal 리소스 제한 전에 메모리 블록 추가 중지; 디스크 작성 계속
`APC_LAYER_MAJOR_MEMORY_MIN_TOKENS`	`50000`	긴 따뜻한 메모리 접두사를 텐서 대신 컴팩트 레이어-주요 스냅샷으로 저장
`APC_HASH`	`fast`	`sha256`로 설정하여 안정적인 암호학적 해시 사용

APC 는 커스텀 캐시 레이아웃을 사용하는 모델에 대해 자동으로 비활성화됩니다. 서버에서는 KV-cache 양자화 (quantization) 가 활성화되면 APC 도 건너뜀니다.

연속 배치 (continuous batching) 시 --kv-bits 를 사용하여 KV cache 메모리 감소:

# Uniform 8-bit KV cache quantization
mlx_vlm.server --model google/gemma-4-26b-a4b-it --kv-bits 8
# TurboQuant 3.5-bit (3-bit keys + 4-bit values)
...

Full-attention 레이어는 양자화된 배치 캐시를 사용하며, 슬라이딩 윈도우 레이어는 고정 크기 회전 캐시를 유지합니다. 마지막 Full-attention 레이어는 양자화되지 않은 상태로 유지됩니다 (심층 모델에서 민감함).

20K 컨텍스트에서 gemma-4-26b-a4b-it 로 테스트:

Config	Gen tok/s	KV Cache	KV Reduction
No quant	50.3	0.624 GB	1x
...

모든 Full-attention 레이어를 가진 모델 (예: Qwen, LLaMA) 은 더 큰 감소율을 보입니다 — 8-bit 에서 최대 3.6 배, 4-bit 에서 최대 6.4 배.

/chat/completions
엔드포인트는 OpenAI 호환 토큰별 로그 확률을 지원합니다. 요청에 logprobs: true
(선택적으로 top_logprobs: N, 최대 20) 를 전달하세요:

curl -X POST "http://localhost:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
...

각 선택지는 생성된 토큰마다 하나씩의 항목을 가진 logprobs.content[]
목록을 제공합니다: {token, logprob, bytes, top_logprobs: [{token, logprob, bytes}, ...]}.
스트리밍 및 비 스트리밍 모두 지원됩니다.

top_logprobs
은 서버가 토큰당 계산할 대안 개수에 대해 0 이 아닌 상한선으로 시작해야 합니다 (기본값 0
= 비활성화, 최대 20). --top-logprobs-k
플래그 또는 TOP_LOGPROBS_K
환경 변수로 설정하세요:

mlx_vlm.server --model mlx-community/Qwen2-VL-2B-Instruct-4bit --top-logprobs-k 5
# 또는
TOP_LOGPROBS_K=5 mlx_vlm.server --model mlx-community/Qwen2-VL-2B-Instruct-4bit

요청별 top_logprobs 는 TOP_LOGPROBS_K 로 제한됩니다. TOP_LOGPROBS_K=0 일 때, logprobs: true 를 가진 요청은 여전히 선택 토큰의 로그 확률을 반환하지만, top_logprobs 목록은 비어있을 뿐입니다. 상한선을 0 으로 유지하면 어휘 전체에 대한 정렬이 디코딩 그래프에서 제외되므로, 로그 확률이 필요하지 않은 배포는 0 오버헤드를 가집니다.

/v1/chat/completions
및 /v1/responses
엔드포인트는 OpenAI 호환 json_schema 구조화된 출력을 지원합니다. 서버는 공급된 JSON 스키마에 생성을 제한하며 스트리밍 및 비 스트리밍 응답 모두를 지원합니다.

Pydantic 로 스키마를 정의할 수 있습니다:

from typing import Literal
from pydantic import BaseModel, ConfigDict, Field
class AnimalResult(BaseModel):
...

OpenAI Python 클라이언트로 로컬 서버를 호출하세요:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
...

예제 출력:

animal='dog' species='Canis lupus familiaris' description='A domesticated canine known for companionship and loyalty.'

Blaizzy/mlx-vlm

요약

핵심 포인트

댓글