A low-latency AI engine for mobile devices & wearables.

Main features:

Fast: fastest inference on ARM CPU
Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
Multimodal: one SDK for speech, vision, and language models
Cloud fallback: automatically route requests to cloud models if needed
Energy-efficient: NPU-accelerated prefill

Cactus Engine

←── OpenAI-compatible APIs for all major languages
Chat, vision, STT, RAG, tool call, cloud handoff

Cactus Graph

←── Zero-copy computation graph (PyTorch for mobile)
Custom models, optimised for RAM & quantisation

Cactus Kernels

←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
Custom attention, KV-cache quant, chunked prefill

Usage Examples

1. Installation and Basic Run

# Step 1:
brew install cactus-compute/cactus/cactus
# Step 2:
cactus transcribe
orcactus run

2. C++ API Example (Chat Completion)

#include "cactus.h"
cactus_model_t model = cactus_init(
"path/to/weight/folder",
"path to txt or dir of txts for auto-rag",
false
);
const char* messages = R"([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Henry Ndubuaku"}
])";
const char* options = R"({
"max_tokens": 50,
"stop_sequences": ["<|im_end|>"]
})";
char response[4096];
int result = cactus_complete(
model, // model handle
messages, // JSON chat messages
response, // response buffer
sizeof(response), // buffer size
options, // generation options
nullptr, // tools JSON
nullptr, // streaming callback
nullptr, // user data
nullptr, // pcm audio buffer
0 // pcm buffer size
);

Example response from Gemma3-270m

{
"success": true, // generation succeeded
"error": null, // error details if failed
"cloud_handoff": false, // true if cloud model used
"response": "Hi there!",
"function_calls": [], // parsed tool calls
"confidence": 0.8193, // model confidence
"time_to_first_token_ms": 45.23,
"total_time_ms": 163.67,
"prefill_tps": 1621.89,
"decode_tps": 168.42,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}

3. C++ API Example (Graph Computation)

#include "cactus.h"
CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();
void* output_data = graph.get_output(result);
graph.hard_reset();

Reference APIs and SDKs

Reference	Language	Description
Engine API	C	Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, VAD, vector index, cloud handoff
Graph API	C++	Tensor operations, matrix multiplication, attention, normalization, activation functions
Python SDK	Python	Mac, Linux
Swift SDK	Swift	iOS, macOS, tvOS, watchOS, Android
Kotlin SDK	Kotlin	Android, iOS (via KMP)
Flutter SDK	Dart	iOS, macOS, Android
Rust SDK	Rust	Mac, Linux
React Native	JavaScript	iOS, Android

Model Weights and Performance Benchmarks

Model weights are pre-converted for all supported models at huggingface.co/Cactus-Compute.

All weights INT4 quantised
LFM: 1k-prefill / 100-decode, values are prefill tps / decode tps
LFM-VL: 256px input, values are latency / decode tps
Parakeet: 20s audio input, values are latency / decode tps
Missing latency = no NPU support yet

Device Performance (LFM)

Device	LFM 1.2B	LFMVL 1.6B	Parakeet 1.1B	RAM
Mac M4 Pro	582/100	0.2s/98	0.1s/900k+	76MB
iPad/Mac M3	350/60	0.3s/69	0.3s/800k+	70MB
iPhone 17 Pro	327/48	0.3s/48	0.3s/300k+	108MB
iPhone 13 Mini	148/34	0.3s/35	0.7s/90k+	1GB
Galaxy S25 Ultra	255/37	-/34	-/250k+	1.5GB
Pixel 6a	70/15	-/15	-/17k+	1GB
Galaxy A17 5G	32/10	-/11	-/40k+	727MB
CMF Phone 2 Pro	-	-	-	-
Raspberry Pi 5	69/11	13.3s/11	4.5s/180k+	869MB

STT: 20s audio input on Macbook Air M3 chip
Benchmark dataset: internal evals with production users

STT Performance (Whisper & Parakeet)

Model	Params	End2End ms	Latency ms	Decode toks/sec	NPU	RTF	WER
UsefulSensors/moonshine-base	61M	361.35	182	262	yes	0.0180	0.1395
openai/whisper-tiny	39M	232.03	137.38	581	yes	0.0116	0.1860
openai/whisper-base	74M	329.37	178.65	358	yes	0.0164	0.1628
openai/whisper-small	244M	856.79	332.63	108	yes	0.0428	0.0930
openai/whisper-medium	769M	2085.87	923.33	49	yes	0.1041	0.0930
openai/whisper-large-v3	1.55B	5994	2050	15.72	no	0.2992	-
nvidia/parakeet-ctc-0.6b	600M	201.77	201.44	5214285	yes	0.0101	0.0930
nvidia/parakeet-tdt-0.6b-v3	600M	718.91	718.82	3583333	yes	0.0359	0.0465
nvidia/parakeet-ctc-1.1b	1.1B	279.03	278.92	4562500	yes	0.0139	0.1628
snakers4/silero-vad	-	-	-	-	-	-	-
pyannote/segmentation-3.0	-	-	-	-	-	-	-
pyannote/wespeaker-voxceleb-resnet34-LM	-	-	-	-	-	-	-

Available Models (HuggingFace)

Gemma Weights: Gemma weights are often gated on HuggingFace, needs tokens. Run huggingface-cli login and input your huggingface token.

Model	Features
google/gemma-3-270m-it	completion
google/functiongemma-270m-it	tools
google/gemma-3-1b-it	completion, gated
google/gemma-4-E2B-it	completion, tools, embed, vision, speech
google/gemma-3n-E2B-it	completion, tools
google/gemma-4-E4B-it	completion, tools, embed, vision, speech
google/gemma-3n-E4B-it	completion, tools
google/gemma-4-E2B-it	vision, audio, completion, tools, Apple NPU
google/gemma-4-E4B-it	vision, audio, completion, tools, Apple NPU
Qwen/Qwen3-0.6B	completion, tools, embed
Qwen/Qwen3-Embedding-0.6B	embed
Qwen/Qwen3.5-0.8B	vision, completion, tools, embed
Qwen/Qwen3-1.7B	completion, tools, embed
Qwen/Qwen3.5-2B	vision, completion, tools, embed
LiquidAI/LFM2.5-350M	completion, tools, embed
LiquidAI/LFM2-700M	completion, tools, embed
LiquidAI/LFM2-8B-A1B	completion, tools, embed
LiquidAI/LFM2.5-1.2B-Thinking	completion, tools, embed
LiquidAI/LFM2.5-1.2B-Instruct	completion, tools, embed
LiquidAI/LFM2-2.6B	completion, tools, embed
LiquidAI/LFM2-VL-450M	vision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-450M	vision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-1.6B	vision, txt & img embed, Apple NPU
tencent/Youtu-LLM-2B	completion, tools, embed
nomic-ai/nomic-embed-text-v2-moe	embed

Development Roadmap

Date	Status	Milestone
Sep 2025	Done	Released v1
Oct 2025	Done	Chunked prefill, KVCache Quant (2x prefill)
Nov 2025	Done	Cactus Attention (10 & 1k prefill = same decode)
Dec 2025	Done	Team grows to +6 Research Engineers
Jan 2026	Done	Apple NPU/RAM, 5-11x faster iOS/Mac
Feb 2026	Done	Hybrid inference, INT4, lossless Quant (1.5x)
Mar 2026	Coming	Qualcomm/Google NPUs, 5-11x faster Android
Apr 2026	Coming	Mediatek/Exynos NPUs, Cactus@ICLR
May 2026	Coming	Kernel→C++, Graph/Engine→Rust, Mac GPU & VR
Jun 2026	Coming	Torch/JAX model transpilers
Jul 2026	Coming	Wearables optimisations, Cactus@ICML
Aug 2026	Coming	Orchestration
Sep 2026	Coming	Full Cactus paper, chip manufacturer partners

Quick Start Guide (Linux)

Step 0: Prerequisites (Ubuntu/Debian)

sudo apt-get install python3 python3-venv python3-pip cmake build-essential libcurl4-openssl-dev

Step 1: Clone and Setup

git clone https://github.com/cactus-compute/cactus && cd cactus
source ./setup

Step 2: Usage Commands

Authentication:

cactus auth manage Cloud API key
--status show key status
--clear remove saved key

Run Models (Playground):

cactus run <model> opens playground (auto downloads)
--precision INT4|INT8|FP16 quantization (default: INT4)
--token <token> HF token (gated models)
--reconvert force reconversion from source

Transcribe Audio:

cactus transcribe [model] live mic transcription (parakeet-tdt-0.6b-v3)
--file <audio.wav> transcribe file instead of mic
--precision INT4|INT8|FP16 quantization (default: INT4)
--token <token> HF token (gated models)
--reconvert force reconversion from source

Download/Convert Models:

cactus download <model> downloads model to ./weights
--precision INT4|INT8|FP16 quantization (default: INT4)
--token <token> HuggingFace API token
--reconvert force reconversion from source

cactus convert <model> [dir] convert model, supports LoRA merge
--precision INT4|INT8|FP16 quantization (default: INT4)
--lora <path> LoRA adapter to merge
--token <token> HuggingFace API token

Build and Test:

cactus build build for ARM → build/libcactus.a
--apple Apple (iOS/macOS)
--android Android
--flutter Flutter (all platforms)
--python shared lib for Python FFI

cactus test run unit tests and benchmarks
--model <model> default: LFM2-VL-450M
--transcribe_model <model> default: moonshine-base
--benchmark use larger models
--precision INT4|INT8|FP16 regenerate weights with precision
--reconvert force reconversion from source
--no-rebuild skip building library
--llm / --stt / --performance run specific test suite
--ios run on connected iPhone
--android run on connected Android

Cleanup:

cactus clean remove all build artifacts
cactus --help show all commands and flags

모바일/웨어러블 기기를 위한 초저지연 AI 엔진, Cactus 소개

요약

핵심 포인트