모바일/웨어러블 기기를 위한 초저지연 AI 엔진, Cactus 소개
요약
Cactus는 모바일 및 웨어러블 장치에 최적화된 저지연 (low-latency) 인공지능 추론 엔진입니다. ARM CPU에서 가장 빠른 추론 속도를 자랑하며, 제로 카피 메모리 매핑을 통해 기존 대비 10배 낮은 RAM 사용량을 구현했습니다. 단일 SDK를 통해 음성(Speech), 비전(Vision), 언어 모델(Language) 등 멀티모달 기능을 지원하며, NPU 가속화 프리필(prefill) 기능과 클라우드 폴백(Cloud fallback)을 자동 처리하여 안정적인 AI 경험을 제공합니다. C++ API와 Python/Sw
핵심 포인트
- ARM CPU에서 가장 빠른 추론 속도를 구현했으며, 제로 카피 메모리 매핑으로 RAM 사용량을 10배 절감했습니다.
- 단일 SDK를 통해 음성 인식(STT), 비전, 언어 모델을 아우르는 멀티모달 기능을 지원합니다.
- NPU 가속화 프리필과 OpenAI 호환 API를 제공하며, 클라우드 연결이 필요할 경우 자동으로 요청을 라우팅하는 하이브리드 구조입니다.
- C++ Graph API와 Python SDK 등 다양한 레퍼런스 API 및 다중 언어 SDK(Swift, Kotlin, Dart, Rust)를 지원하여 광범위한 플랫폼 호환성을 제공합니다.
A low-latency AI engine for mobile devices & wearables.
Main features:
- Fast: fastest inference on ARM CPU
- Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
- Multimodal: one SDK for speech, vision, and language models
- Cloud fallback: automatically route requests to cloud models if needed
- Energy-efficient: NPU-accelerated prefill
Cactus Engine
- ←── OpenAI-compatible APIs for all major languages
Chat, vision, STT, RAG, tool call, cloud handoff
Cactus Graph
- ←── Zero-copy computation graph (PyTorch for mobile)
Custom models, optimised for RAM & quantisation
Cactus Kernels
- ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
Custom attention, KV-cache quant, chunked prefill
Usage Examples
1. Installation and Basic Run
# Step 1:
brew install cactus-compute/cactus/cactus
# Step 2:
cactus transcribe
orcactus run
2. C++ API Example (Chat Completion)
#include "cactus.h"
cactus_model_t model = cactus_init(
"path/to/weight/folder",
"path to txt or dir of txts for auto-rag",
false
);
const char* messages = R"([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Henry Ndubuaku"}
])";
const char* options = R"({
"max_tokens": 50,
"stop_sequences": ["<|im_end|>"]
})";
char response[4096];
int result = cactus_complete(
model, // model handle
messages, // JSON chat messages
response, // response buffer
sizeof(response), // buffer size
options, // generation options
nullptr, // tools JSON
nullptr, // streaming callback
nullptr, // user data
nullptr, // pcm audio buffer
0 // pcm buffer size
);
Example response from Gemma3-270m
{
"success": true, // generation succeeded
"error": null, // error details if failed
"cloud_handoff": false, // true if cloud model used
"response": "Hi there!",
"function_calls": [], // parsed tool calls
"confidence": 0.8193, // model confidence
"time_to_first_token_ms": 45.23,
"total_time_ms": 163.67,
"prefill_tps": 1621.89,
"decode_tps": 168.42,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}
3. C++ API Example (Graph Computation)
#include "cactus.h"
CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();
void* output_data = graph.get_output(result);
graph.hard_reset();
Reference APIs and SDKs
| Reference | Language | Description |
|---|---|---|
| Engine API | C | Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, VAD, vector index, cloud handoff |
| Graph API | C++ | Tensor operations, matrix multiplication, attention, normalization, activation functions |
| Python SDK | Python | Mac, Linux |
| Swift SDK | Swift | iOS, macOS, tvOS, watchOS, Android |
| Kotlin SDK | Kotlin | Android, iOS (via KMP) |
| Flutter SDK | Dart | iOS, macOS, Android |
| Rust SDK | Rust | Mac, Linux |
| React Native | JavaScript | iOS, Android |
Model Weights and Performance Benchmarks
Model weights are pre-converted for all supported models at huggingface.co/Cactus-Compute.
- All weights INT4 quantised
- LFM: 1k-prefill / 100-decode, values are prefill tps / decode tps
- LFM-VL: 256px input, values are latency / decode tps
- Parakeet: 20s audio input, values are latency / decode tps
- Missing latency = no NPU support yet
Device Performance (LFM)
| Device | LFM 1.2B | LFMVL 1.6B | Parakeet 1.1B | RAM |
|---|---|---|---|---|
| Mac M4 Pro | 582/100 | 0.2s/98 | 0.1s/900k+ | 76MB |
| iPad/Mac M3 | 350/60 | 0.3s/69 | 0.3s/800k+ | 70MB |
| iPhone 17 Pro | 327/48 | 0.3s/48 | 0.3s/300k+ | 108MB |
| iPhone 13 Mini | 148/34 | 0.3s/35 | 0.7s/90k+ | 1GB |
| Galaxy S25 Ultra | 255/37 | -/34 | -/250k+ | 1.5GB |
| Pixel 6a | 70/15 | -/15 | -/17k+ | 1GB |
| Galaxy A17 5G | 32/10 | -/11 | -/40k+ | 727MB |
| CMF Phone 2 Pro | - | - | - | - |
| Raspberry Pi 5 | 69/11 | 13.3s/11 | 4.5s/180k+ | 869MB |
- STT: 20s audio input on Macbook Air M3 chip
- Benchmark dataset: internal evals with production users
STT Performance (Whisper & Parakeet)
| Model | Params | End2End ms | Latency ms | Decode toks/sec | NPU | RTF | WER |
|---|---|---|---|---|---|---|---|
| UsefulSensors/moonshine-base | 61M | 361.35 | 182 | 262 | yes | 0.0180 | 0.1395 |
| openai/whisper-tiny | 39M | 232.03 | 137.38 | 581 | yes | 0.0116 | 0.1860 |
| openai/whisper-base | 74M | 329.37 | 178.65 | 358 | yes | 0.0164 | 0.1628 |
| openai/whisper-small | 244M | 856.79 | 332.63 | 108 | yes | 0.0428 | 0.0930 |
| openai/whisper-medium | 769M | 2085.87 | 923.33 | 49 | yes | 0.1041 | 0.0930 |
| openai/whisper-large-v3 | 1.55B | 5994 | 2050 | 15.72 | no | 0.2992 | - |
| nvidia/parakeet-ctc-0.6b | 600M | 201.77 | 201.44 | 5214285 | yes | 0.0101 | 0.0930 |
| nvidia/parakeet-tdt-0.6b-v3 | 600M | 718.91 | 718.82 | 3583333 | yes | 0.0359 | 0.0465 |
| nvidia/parakeet-ctc-1.1b | 1.1B | 279.03 | 278.92 | 4562500 | yes | 0.0139 | 0.1628 |
| snakers4/silero-vad | - | - | - | - | - | - | - |
| pyannote/segmentation-3.0 | - | - | - | - | - | - | - |
| pyannote/wespeaker-voxceleb-resnet34-LM | - | - | - | - | - | - | - |
Available Models (HuggingFace)
Gemma Weights: Gemma weights are often gated on HuggingFace, needs tokens. Run huggingface-cli login and input your huggingface token.
| Model | Features |
|---|---|
| google/gemma-3-270m-it | completion |
| google/functiongemma-270m-it | tools |
| google/gemma-3-1b-it | completion, gated |
| google/gemma-4-E2B-it | completion, tools, embed, vision, speech |
| google/gemma-3n-E2B-it | completion, tools |
| google/gemma-4-E4B-it | completion, tools, embed, vision, speech |
| google/gemma-3n-E4B-it | completion, tools |
| google/gemma-4-E2B-it | vision, audio, completion, tools, Apple NPU |
| google/gemma-4-E4B-it | vision, audio, completion, tools, Apple NPU |
| Qwen/Qwen3-0.6B | completion, tools, embed |
| Qwen/Qwen3-Embedding-0.6B | embed |
| Qwen/Qwen3.5-0.8B | vision, completion, tools, embed |
| Qwen/Qwen3-1.7B | completion, tools, embed |
| Qwen/Qwen3.5-2B | vision, completion, tools, embed |
| LiquidAI/LFM2.5-350M | completion, tools, embed |
| LiquidAI/LFM2-700M | completion, tools, embed |
| LiquidAI/LFM2-8B-A1B | completion, tools, embed |
| LiquidAI/LFM2.5-1.2B-Thinking | completion, tools, embed |
| LiquidAI/LFM2.5-1.2B-Instruct | completion, tools, embed |
| LiquidAI/LFM2-2.6B | completion, tools, embed |
| LiquidAI/LFM2-VL-450M | vision, txt & img embed, Apple NPU |
| LiquidAI/LFM2.5-VL-450M | vision, txt & img embed, Apple NPU |
| LiquidAI/LFM2.5-VL-1.6B | vision, txt & img embed, Apple NPU |
| tencent/Youtu-LLM-2B | completion, tools, embed |
| nomic-ai/nomic-embed-text-v2-moe | embed |
Development Roadmap
| Date | Status | Milestone |
|---|---|---|
| Sep 2025 | Done | Released v1 |
| Oct 2025 | Done | Chunked prefill, KVCache Quant (2x prefill) |
| Nov 2025 | Done | Cactus Attention (10 & 1k prefill = same decode) |
| Dec 2025 | Done | Team grows to +6 Research Engineers |
| Jan 2026 | Done | Apple NPU/RAM, 5-11x faster iOS/Mac |
| Feb 2026 | Done | Hybrid inference, INT4, lossless Quant (1.5x) |
| Mar 2026 | Coming | Qualcomm/Google NPUs, 5-11x faster Android |
| Apr 2026 | Coming | Mediatek/Exynos NPUs, Cactus@ICLR |
| May 2026 | Coming | Kernel→C++, Graph/Engine→Rust, Mac GPU & VR |
| Jun 2026 | Coming | Torch/JAX model transpilers |
| Jul 2026 | Coming | Wearables optimisations, Cactus@ICML |
| Aug 2026 | Coming | Orchestration |
| Sep 2026 | Coming | Full Cactus paper, chip manufacturer partners |
Quick Start Guide (Linux)
Step 0: Prerequisites (Ubuntu/Debian)
sudo apt-get install python3 python3-venv python3-pip cmake build-essential libcurl4-openssl-dev
Step 1: Clone and Setup
git clone https://github.com/cactus-compute/cactus && cd cactus
source ./setup
Step 2: Usage Commands
Authentication:
cactus auth manage Cloud API key
--status show key status
--clear remove saved key
Run Models (Playground):
cactus run <model> opens playground (auto downloads)
--precision INT4|INT8|FP16 quantization (default: INT4)
--token <token> HF token (gated models)
--reconvert force reconversion from source
Transcribe Audio:
cactus transcribe [model] live mic transcription (parakeet-tdt-0.6b-v3)
--file <audio.wav> transcribe file instead of mic
--precision INT4|INT8|FP16 quantization (default: INT4)
--token <token> HF token (gated models)
--reconvert force reconversion from source
Download/Convert Models:
cactus download <model> downloads model to ./weights
--precision INT4|INT8|FP16 quantization (default: INT4)
--token <token> HuggingFace API token
--reconvert force reconversion from source
cactus convert <model> [dir] convert model, supports LoRA merge
--precision INT4|INT8|FP16 quantization (default: INT4)
--lora <path> LoRA adapter to merge
--token <token> HuggingFace API token
Build and Test:
cactus build build for ARM → build/libcactus.a
--apple Apple (iOS/macOS)
--android Android
--flutter Flutter (all platforms)
--python shared lib for Python FFI
cactus test run unit tests and benchmarks
--model <model> default: LFM2-VL-450M
--transcribe_model <model> default: moonshine-base
--benchmark use larger models
--precision INT4|INT8|FP16 regenerate weights with precision
--reconvert force reconversion from source
--no-rebuild skip building library
--llm / --stt / --performance run specific test suite
--ios run on connected iPhone
--android run on connected Android
Cleanup:
cactus clean remove all build artifacts
cactus --help show all commands and flags
AI 자동 생성 콘텐츠
본 콘텐츠는 HN AI Engineering의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.
원문 바로가기