본문으로 건너뛰기

© 2026 Molayo

HN중요요약2026. 04. 24. 12:57

모바일/웨어러블 기기를 위한 초저지연 AI 엔진, Cactus 소개

요약

Cactus는 모바일 및 웨어러블 장치에 최적화된 저지연 (low-latency) 인공지능 추론 엔진입니다. ARM CPU에서 가장 빠른 추론 속도를 자랑하며, 제로 카피 메모리 매핑을 통해 기존 대비 10배 낮은 RAM 사용량을 구현했습니다. 단일 SDK를 통해 음성(Speech), 비전(Vision), 언어 모델(Language) 등 멀티모달 기능을 지원하며, NPU 가속화 프리필(prefill) 기능과 클라우드 폴백(Cloud fallback)을 자동 처리하여 안정적인 AI 경험을 제공합니다. C++ API와 Python/Sw

핵심 포인트

  • ARM CPU에서 가장 빠른 추론 속도를 구현했으며, 제로 카피 메모리 매핑으로 RAM 사용량을 10배 절감했습니다.
  • 단일 SDK를 통해 음성 인식(STT), 비전, 언어 모델을 아우르는 멀티모달 기능을 지원합니다.
  • NPU 가속화 프리필과 OpenAI 호환 API를 제공하며, 클라우드 연결이 필요할 경우 자동으로 요청을 라우팅하는 하이브리드 구조입니다.
  • C++ Graph API와 Python SDK 등 다양한 레퍼런스 API 및 다중 언어 SDK(Swift, Kotlin, Dart, Rust)를 지원하여 광범위한 플랫폼 호환성을 제공합니다.

A low-latency AI engine for mobile devices & wearables.

Main features:

  • Fast: fastest inference on ARM CPU
  • Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
  • Multimodal: one SDK for speech, vision, and language models
  • Cloud fallback: automatically route requests to cloud models if needed
  • Energy-efficient: NPU-accelerated prefill

Cactus Engine

  • ←── OpenAI-compatible APIs for all major languages
    Chat, vision, STT, RAG, tool call, cloud handoff

Cactus Graph

  • ←── Zero-copy computation graph (PyTorch for mobile)
    Custom models, optimised for RAM & quantisation

Cactus Kernels

  • ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
    Custom attention, KV-cache quant, chunked prefill

Usage Examples

1. Installation and Basic Run

# Step 1:
brew install cactus-compute/cactus/cactus
# Step 2:
cactus transcribe
orcactus run

2. C++ API Example (Chat Completion)

#include "cactus.h"
cactus_model_t model = cactus_init(
"path/to/weight/folder",
"path to txt or dir of txts for auto-rag",
false
);
const char* messages = R"([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Henry Ndubuaku"}
])";
const char* options = R"({
"max_tokens": 50,
"stop_sequences": ["<|im_end|>"]
})";
char response[4096];
int result = cactus_complete(
model, // model handle
messages, // JSON chat messages
response, // response buffer
sizeof(response), // buffer size
options, // generation options
nullptr, // tools JSON
nullptr, // streaming callback
nullptr, // user data
nullptr, // pcm audio buffer
0 // pcm buffer size
);

Example response from Gemma3-270m

{
"success": true, // generation succeeded
"error": null, // error details if failed
"cloud_handoff": false, // true if cloud model used
"response": "Hi there!",
"function_calls": [], // parsed tool calls
"confidence": 0.8193, // model confidence
"time_to_first_token_ms": 45.23,
"total_time_ms": 163.67,
"prefill_tps": 1621.89,
"decode_tps": 168.42,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}

3. C++ API Example (Graph Computation)

#include "cactus.h"
CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();
void* output_data = graph.get_output(result);
graph.hard_reset();

Reference APIs and SDKs

ReferenceLanguageDescription
Engine APICChat completion, streaming, tool calling, transcription, embeddings, RAG, vision, VAD, vector index, cloud handoff
Graph APIC++Tensor operations, matrix multiplication, attention, normalization, activation functions
Python SDKPythonMac, Linux
Swift SDKSwiftiOS, macOS, tvOS, watchOS, Android
Kotlin SDKKotlinAndroid, iOS (via KMP)
Flutter SDKDartiOS, macOS, Android
Rust SDKRustMac, Linux
React NativeJavaScriptiOS, Android

Model Weights and Performance Benchmarks

Model weights are pre-converted for all supported models at huggingface.co/Cactus-Compute.

  • All weights INT4 quantised
  • LFM: 1k-prefill / 100-decode, values are prefill tps / decode tps
  • LFM-VL: 256px input, values are latency / decode tps
  • Parakeet: 20s audio input, values are latency / decode tps
  • Missing latency = no NPU support yet

Device Performance (LFM)

DeviceLFM 1.2BLFMVL 1.6BParakeet 1.1BRAM
Mac M4 Pro582/1000.2s/980.1s/900k+76MB
iPad/Mac M3350/600.3s/690.3s/800k+70MB
iPhone 17 Pro327/480.3s/480.3s/300k+108MB
iPhone 13 Mini148/340.3s/350.7s/90k+1GB
Galaxy S25 Ultra255/37-/34-/250k+1.5GB
Pixel 6a70/15-/15-/17k+1GB
Galaxy A17 5G32/10-/11-/40k+727MB
CMF Phone 2 Pro----
Raspberry Pi 569/1113.3s/114.5s/180k+869MB
  • STT: 20s audio input on Macbook Air M3 chip
  • Benchmark dataset: internal evals with production users

STT Performance (Whisper & Parakeet)

ModelParamsEnd2End msLatency msDecode toks/secNPURTFWER
UsefulSensors/moonshine-base61M361.35182262yes0.01800.1395
openai/whisper-tiny39M232.03137.38581yes0.01160.1860
openai/whisper-base74M329.37178.65358yes0.01640.1628
openai/whisper-small244M856.79332.63108yes0.04280.0930
openai/whisper-medium769M2085.87923.3349yes0.10410.0930
openai/whisper-large-v31.55B5994205015.72no0.2992-
nvidia/parakeet-ctc-0.6b600M201.77201.445214285yes0.01010.0930
nvidia/parakeet-tdt-0.6b-v3600M718.91718.823583333yes0.03590.0465
nvidia/parakeet-ctc-1.1b1.1B279.03278.924562500yes0.01390.1628
snakers4/silero-vad-------
pyannote/segmentation-3.0-------
pyannote/wespeaker-voxceleb-resnet34-LM-------

Available Models (HuggingFace)

Gemma Weights: Gemma weights are often gated on HuggingFace, needs tokens. Run huggingface-cli login and input your huggingface token.

ModelFeatures
google/gemma-3-270m-itcompletion
google/functiongemma-270m-ittools
google/gemma-3-1b-itcompletion, gated
google/gemma-4-E2B-itcompletion, tools, embed, vision, speech
google/gemma-3n-E2B-itcompletion, tools
google/gemma-4-E4B-itcompletion, tools, embed, vision, speech
google/gemma-3n-E4B-itcompletion, tools
google/gemma-4-E2B-itvision, audio, completion, tools, Apple NPU
google/gemma-4-E4B-itvision, audio, completion, tools, Apple NPU
Qwen/Qwen3-0.6Bcompletion, tools, embed
Qwen/Qwen3-Embedding-0.6Bembed
Qwen/Qwen3.5-0.8Bvision, completion, tools, embed
Qwen/Qwen3-1.7Bcompletion, tools, embed
Qwen/Qwen3.5-2Bvision, completion, tools, embed
LiquidAI/LFM2.5-350Mcompletion, tools, embed
LiquidAI/LFM2-700Mcompletion, tools, embed
LiquidAI/LFM2-8B-A1Bcompletion, tools, embed
LiquidAI/LFM2.5-1.2B-Thinkingcompletion, tools, embed
LiquidAI/LFM2.5-1.2B-Instructcompletion, tools, embed
LiquidAI/LFM2-2.6Bcompletion, tools, embed
LiquidAI/LFM2-VL-450Mvision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-450Mvision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-1.6Bvision, txt & img embed, Apple NPU
tencent/Youtu-LLM-2Bcompletion, tools, embed
nomic-ai/nomic-embed-text-v2-moeembed

Development Roadmap

DateStatusMilestone
Sep 2025DoneReleased v1
Oct 2025DoneChunked prefill, KVCache Quant (2x prefill)
Nov 2025DoneCactus Attention (10 & 1k prefill = same decode)
Dec 2025DoneTeam grows to +6 Research Engineers
Jan 2026DoneApple NPU/RAM, 5-11x faster iOS/Mac
Feb 2026DoneHybrid inference, INT4, lossless Quant (1.5x)
Mar 2026ComingQualcomm/Google NPUs, 5-11x faster Android
Apr 2026ComingMediatek/Exynos NPUs, Cactus@ICLR
May 2026ComingKernel→C++, Graph/Engine→Rust, Mac GPU & VR
Jun 2026ComingTorch/JAX model transpilers
Jul 2026ComingWearables optimisations, Cactus@ICML
Aug 2026ComingOrchestration
Sep 2026ComingFull Cactus paper, chip manufacturer partners

Quick Start Guide (Linux)

Step 0: Prerequisites (Ubuntu/Debian)

sudo apt-get install python3 python3-venv python3-pip cmake build-essential libcurl4-openssl-dev

Step 1: Clone and Setup

git clone https://github.com/cactus-compute/cactus && cd cactus
source ./setup

Step 2: Usage Commands

Authentication:

cactus auth manage Cloud API key
--status show key status
--clear remove saved key

Run Models (Playground):

cactus run <model> opens playground (auto downloads)
--precision INT4|INT8|FP16 quantization (default: INT4)
--token <token> HF token (gated models)
--reconvert force reconversion from source

Transcribe Audio:

cactus transcribe [model] live mic transcription (parakeet-tdt-0.6b-v3)
--file <audio.wav> transcribe file instead of mic
--precision INT4|INT8|FP16 quantization (default: INT4)
--token <token> HF token (gated models)
--reconvert force reconversion from source

Download/Convert Models:

cactus download <model> downloads model to ./weights
--precision INT4|INT8|FP16 quantization (default: INT4)
--token <token> HuggingFace API token
--reconvert force reconversion from source
cactus convert <model> [dir] convert model, supports LoRA merge
--precision INT4|INT8|FP16 quantization (default: INT4)
--lora <path> LoRA adapter to merge
--token <token> HuggingFace API token

Build and Test:

cactus build build for ARM → build/libcactus.a
--apple Apple (iOS/macOS)
--android Android
--flutter Flutter (all platforms)
--python shared lib for Python FFI

cactus test run unit tests and benchmarks
--model <model> default: LFM2-VL-450M
--transcribe_model <model> default: moonshine-base
--benchmark use larger models
--precision INT4|INT8|FP16 regenerate weights with precision
--reconvert force reconversion from source
--no-rebuild skip building library
--llm / --stt / --performance run specific test suite
--ios run on connected iPhone
--android run on connected Android

Cleanup:

cactus clean remove all build artifacts
cactus --help show all commands and flags

AI 자동 생성 콘텐츠

본 콘텐츠는 HN AI Engineering의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.

원문 바로가기
3

댓글

0