Show HN: RunMat – 밀집 수학 연산을 위한 자동 CPU/GPU 라우팅 런타임 - Insights | Molayo

<p align="center"> <img src=".github/assets/runmat-symbol.svg" alt="RunMat" height="80"> </p> <h1 align="center">RunMat</h1> <p align="center"> <strong>수학 연산을 위한 오픈 소스 (Open-source) 런타임 (runtime). MATLAB 문법 사용. CPU + GPU 지원. 라이선스 비용 없음.</strong> </p> <p align="center"> RunMat은 연산을 자동으로 융합 (fuse)하고 CPU와 GPU 사이를 지능적으로 라우팅 (route)합니다.<br/> Windows, macOS, Linux 및 WebAssembly에서 작동하며, NVIDIA, AMD, Apple Silicon 및 Intel GPU를 지원합니다.<br/> 표준 MATLAB 문법으로 작성하면 나머지는 RunMat이 처리합니다. </p> <p align="center"> <a href="https://github.com/runmat-org/runmat/actions"><img src="https://img.shields.io/github/actions/workflow/status/runmat-org/runmat/ci.yml?branch=main" alt="Build Status"></a> <a href="LICENSE.md"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License: MIT"></a> <a href="https://crates.io/crates/runmat"><img src="https://img.shields.io/crates/v/runmat.svg" alt="Crates.io"></a> <a href="https://crates.io/crates/runmat"><img src="https://img.shields.io/crates/d/runmat.svg" alt="Downloads"></a> <a href="https://github.com/runmat-org/runmat/stargazers"><img src="https://img.shields.io/github/stars/runmat-org/runmat" alt="GitHub Stars"></a> <a href="https://github.com/runmat-org/runmat/commits/main"><img src="https://img.shields.io/github/last-commit/runmat-org/runmat" alt="Last Commit"></a> </p> <p align="center"> <a href="https://runmat.com/sandbox"><strong>지금 바로 체험해 보세요 — 설치가 필요 없습니다</strong></a> · <a href="https://runmat.com/docs">문서 (Docs)</a> · <a href="https://runmat.com/blog">블로그 (Blog)</a> · <a href="docs/CHANGELOG.md">변경 이력 (Changelog)</a> · <a href="https://runmat.com">웹사이트 (Website)</a> </p>

curl -fsSL https://runmat.com/install.sh | sh    # Linux / macOS

iwr https://runmat.com/install.ps1 | iex         # Windows PowerShell

[!NOTE]
프리릴리스 (Pre-release, v0.4) — 핵심 런타임 (runtime) 및 GPU 엔진이 수천 개의 테스트를 통과했습니다. 다소 미흡한 부분이 있을 수 있습니다.

RunMat이란 무엇인가요?

RunMat을 사용하면 깔끔하고 읽기 쉬운 MATLAB 스타일의 구문으로 수학 연산을 작성할 수 있습니다. RunMat은 사용자의 연산을 최적화된 커널 (Kernel)로 자동으로 융합 (Fuse) 하여, 사용 가능한 최적의 하드웨어인 CPU 또는 GPU에서 실행합니다. GPU 환경에서는 많은 밀집 수치 연산 (Dense numerical workloads) 작업에서 직접 작성한 CUDA 코드와 대등하거나 이를 능가하는 성능을 보여주기도 합니다.

특정 벤더에 종속되지 않고 네이티브 API (Metal / DirectX 12 / Vulkan)를 통해 사용자가 보유한 어떤 GPU (NVIDIA, AMD, Apple Silicon, Intel)에서도 실행됩니다.

x  = 0:0.01:4*pi;
y0 = sin(x) .* exp(-x / 10);
y1 = y0 .* cos(x / 4) + 0.25 .* (y0 .^ 2);
...

아래 그래프의 점들은 위 x 벡터의 요소 (Element) 개수에 대응합니다:

핵심 아이디어:

새로운 언어가 아닌 MATLAB 입력 언어 호환성
단일 런타임 (Runtime)으로 CPU 및 GPU에서 빠른 성능 제공
장치 플래그 (Device flags) 불필요: 데이터 크기와 전송 비용 휴리스틱 (Heuristics)을 기반으로 융합 (Fusion) 단계에서 CPU와 GPU를 자동으로 선택합니다

RunMat 사용 방법

이 저장소(Repo)에 포함된 오픈 소스 런타임은 모든 RunMat 인터페이스의 기반이 됩니다:

<div align="center"> <table> <tr> <td align="center" width="20%"> <h3>🌐 브라우저 (Browser)</h3> 설치 불필요<br/><br/> WebAssembly + WebGPU를 통해 실행됩니다.<br/> 코드가 기기를 절대 벗어나지 않습니다.<br/><br/> <a href="https://runmat.com/sandbox"><strong>지금 시도하기 →</strong></a> </td> <td align="center" width="20%"> <h3>⌨️ CLI</h3> 오픈 소스 (본 저장소)<br/><br/> <code class="language-matlab">.m</code> 파일 실행, 벤치마크,<br/> CI/CD 통합.<br/><br/> <code class="language-bash">cargo install runmat</code> </td> <td align="center" width="20%"> <h3>📦 NPM</h3> 어디든 임베딩 가능<br/><br/> 실행, GPU, 플로팅 (Plotting)을 포함한<br/> 전체 런타임을 모든 웹 앱에서 사용 가능.<br/><br/> <a href="https://www.npmjs.com/package/runmat"><code class="language-bash">npm install runmat</code></a> </td> <td align="center" width="20%"> <h3>🖥️ 데스크톱 (Desktop)</h3> 출시 예정<br/><br/> 로컬 파일 및 완전한 GPU 가속을 지원하는<br/> 네이티브 IDE. <br/><br/>   </td> <td align="center" width="20%"> <h3>🌐 앱 (App)</h3> 취미용 티어 이용 가능<br/><br/> 버전 관리, 협업, 팀 관리. <br/><br/> </td> </tr> </table> </div>

<a href="https://runmat.com/pricing"><strong>가격 정책 (Pricing) →</strong></a>

</td> </tr> </table> </div>

주요 기능 요약

MATLAB 입력 언어 호환성
- 친숙한 .m 파일, 배열 (arrays), 제어 흐름 (control flow) 지원
- 많은 MATLAB / Octave 스크립트가 거의 또는 전혀 수정 없이 실행 가능
Fusion: 자동 CPU+GPU 선택
- 배열 연산 (array ops)의 내부 그래프 (internal graph) 구축
- 요소별 연산 (elementwise ops) 및 리덕션 (reductions)을 더 큰 커널 (kernels)로 퓨전 (fuse)
- 형태 (shape)와 전송 비용 (transfer cost)을 기반으로 커널당 CPU 또는 GPU 선택
- 더 빠른 경우 배열을 디바이스 (device)에 유지
현대적인 CPU 런타임 (runtime)
- 빠른 시작을 위한 VM 인터프리터 (interpreter)
- 핫 패스 (hot paths)를 위한 Turbine JIT (Cranelift)
- 수치 코드 (numeric code)에 최적화된 세대별 가비지 컬렉션 (Generational GC)
- 설계 단계부터 메모리 안전성 확보 (Rust)
크로스 플랫폼 GPU 백엔드 (backend)
- wgpu / WebGPU 사용
- Metal (macOS), DirectX 12 (Windows), Vulkan (Linux), WebGPU (브라우저) 지원
- 워크로드 (workloads)가 GPU를 사용하는 것보다 작을 경우 CPU로 폴백 (fallback)
비동기 지원 런타임 (Async-capable runtime)
- 비차단 평가 (non-blocking evaluation)를 위해 Rust futures 기반으로 구축
- GPU 읽기 (readback), 대화형 입력 (interactive input), 장시간 실행되는 스크립트가 호스트 (host)를 절대 차단하지 않음
- 협력적 태스크 (cooperative tasks)를 포함한 언어 수준의 async/await가 로드맵에 포함됨
- RunMat 스크립트는 페이지를 멈추지 않고 브라우저에서 대화형으로 실행 가능 (MATLAB에는 이에 상응하는 기능이 없음)
WebAssembly 타겟 + NPM 패키지
- 전체 런타임이 WASM으로 컴파일되며 이 리포지토리(runmat-wasm)의 일부로 제공됨
- 모든 웹 앱에 실행, GPU 가속 및 플로팅 (plotting)을 임베딩할 수 있도록 NPM의 runmat으로 제공됨
- GPU 가속은 WebGPU를 통해 브라우저에서 작동
- 브라우저 샌드박스 (browser sandbox)를 구동하며, 코드는 서버가 아닌 로컬에서 실행됨
플로팅 (Plotting)
GPU 가속 렌더링 (GPU-accelerated rendering)을 지원하는 인터랙티브 2D 및 3D 플롯 (plots)
- 30개 이상의 플롯 (plot) 유형: line, scatter, bar, surface, mesh, histogram, stem, errorbar, area, contour, pie, plot3, imagesc, imshow 및 로그 스케일 (log-scale) 변형
- 그래픽 핸들 (Graphics handles), 서브플롯 상태 (subplot state), 내장 어노테이션 (annotation builtins) (title, sgtitle, xlabel, legend), 그리고 3D 카메라 컨트롤
오픈 소스 플로팅 (plotting) 엔진 데모 (CLI 및 브라우저 샌드박스에서 작동):
<p align="center"> <a href=".github/assets/runmat-sandbox-3d-plotting.gif"><strong>GIF 직접 열기</strong></a> · <a href="https://runmat.com/sandbox"><strong>브라우저 샌드박스에서 시도하기 →</strong></a> </p>
오픈 소스 런타임 (Open-source runtime)
- 전체 런타임 (runtime), GPU 엔진 (GPU engine), JIT, GC, 그리고 플로팅 (plotting) (이 리포지토리의 모든 것)은 MIT 라이선스를 따릅니다.
- 작은 바이너리 (Small binary), CLI 우선 설계 (CLI-first design)

문서 (Documentation)

<details> <summary><strong>관련 문서</strong> — 전체 목록은 <a href="docs/">docs/</a>를 참조하세요</summary>

시작하기 (Getting started)
언어 및 런타임 (Language & runtime)
GPU 가속 (GPU acceleration)
플로팅 (Plotting)
- 플로팅 가이드 (Plotting guide)
런타임 아키텍처 (Runtime architecture)

</details>

임베딩 및 통합 (Embedding & integration)
- NPM 패키지 (runmat)
- 브라우저 샌드박스 가이드 (Browser sandbox guide)
참조 (Reference)
- 함수 참조 (400개 이상의 내장 함수 (builtins))
기여하기 (Contributing)
- 기여 가이드 (Contributing guide)
- 개발자 설정 (Developer setup)

</details>

성능 (Performance)

몬테카를로 시뮬레이션 (Monte Carlo simulations)에서 NumPy보다 최대 131배, PyTorch보다 7배 빠릅니다. 하드웨어: Apple M2 Max, Metal. 3회 실행의 중앙값 기준입니다.

<details> <summary><strong>몬테카를로 원시 데이터 (Monte Carlo raw data)</strong></summary>

경로 (시뮬레이션) (Paths (simulations))	RunMat (ms)	PyTorch (ms)	NumPy (ms)	NumPy ÷ RunMat	PyTorch ÷ RunMat
250k	108.58	824.42	4,065.87	37.44×	7.59×
...

</details> <details> <summary><strong>4K 이미지 파이프라인 (4K Image Pipeline)</strong> — NumPy보다 최대 10배 빠름</summary>

B	RunMat (ms)	PyTorch (ms)	NumPy (ms)	NumPy ÷ RunMat	PyTorch ÷ RunMat
4	142.97	801.29	500.34	3.50×	5.60×
...

</details> <details> <summary><strong>요소별 연산 (Elementwise Math)</strong> — 10억(1B) 개의 요소에서 PyTorch보다 최대 144배 빠름</summary>

지점 (points)	RunMat (ms)	PyTorch (ms)	NumPy (ms)	NumPy ÷ RunMat	PyTorch ÷ RunMat
1M	145.15	856.41	72.39	0.50×	5.90×
...

</details>

더 작은 배열의 경우, 퓨전 (Fusion)이 작업을 CPU에 유지하므로 낮은 오버헤드와 빠른 JIT (Just-In-Time 컴파일)를 여전히 누릴 수 있습니다.

재현 가능한 테스트 스크립트, 상세 결과, 그리고 NumPy, PyTorch, Julia와의 비교는 benchmarks/를 참조하세요.

<details> <summary><strong>빠른 시작 (Quick Start): 모든 설치 방법, CLI 기능 및 Jupyter 통합</strong></summary>

설치 (Installation)

# 빠른 설치 (Linux/macOS)
curl -fsSL https://runmat.com/install.sh | sh

...

Linux 필수 요구 사항 (Linux prerequisite)

Linux에서 BLAS/LAPACK 가속 (acceleration)을 사용하려면, 빌드하기 전에 시스템의 OpenBLAS 패키지를 설치하십시오:

sudo apt-get update && sudo apt-get install -y libopenblas-dev

첫 번째 스크립트 실행하기 (Run Your First Script)

# 대화형 REPL 시작
runmat

...

CLI 기능 (CLI Features)

# GPU 가속 상태 확인
runmat accel-info

...

전체 명령어 참조는 CLI 문서 (CLI Documentation)를 확인하십시오.

Jupyter 통합 (Jupyter Integration)

# RunMat을 Jupyter 커널 (kernel)로 등록
runmat --install-kernel

...

</details> <details> <summary><strong>아키텍처: CPU+GPU 성능 (Architecture: CPU+GPU performance)</strong></summary>

RunMat은 계층형 CPU 런타임 (tiered CPU runtime)과 각 수학 연산 청크 (chunk)에 대해 CPU 또는 GPU를 자동으로 선택하는 퓨전 엔진 (fusion engine)을 사용합니다. 아래의 모든 구성 요소는 오픈 소스이며 이 리포지토리 (repository)에 포함되어 있습니다.

주요 구성 요소 (Key components)

구성 요소 (Component)	목적 (Purpose)	기술 / 참고 사항 (Technology / Notes)
runmat-vm	즉각적인 시작을 위한 기본 인터프리터 (Baseline interpreter)	HIR-to-bytecode 컴파일러, 스택 기반 인터프리터
...

이것이 중요한 이유 (Why this matters)

**계층형 CPU 실행 (Tiered CPU execution)**은 빠른 시작과 강력한 단일 머신 성능을 제공합니다.
**퓨전 엔진 (Fusion engine)**은 대부분의 수동 장치 관리 (device management) 및 커널 튜닝 (kernel tuning)을 제거합니다.
**GPU 백엔드 (GPU backend)**는 Metal / DirectX 12 / Vulkan을 통해 NVIDIA, AMD, Apple Silicon 및 Intel에서 작동하며, 특정 벤더 종속성 (vendor lock-in)이 없습니다.

</details> <details> <summary><strong>GPU 가속: 퓨전 및 자동 오프로드 (GPU Acceleration: Fusion & Auto-Offload)</strong></summary>

RunMat은 커널 코드나 코드 재작성 없이도 MATLAB 코드를 GPU에서 자동으로 가속합니다. 시스템은 다음 네 가지 단계를 통해 작동합니다:

1. 수학 연산 캡처 (Capture the Math) — RunMat은 연산의 의도(형태(shapes), 연산 카테고리, 의존성 및 상수)를 캡처하는 "가속 그래프 (acceleration graph)"를 구축합니다.

2. GPU에서 실행할 항목 결정 (Decide What Should Run on GPU) — 퓨전 엔진 (fusion engine)은 요소별 연산 (elementwise operations)의 긴 체인과 연결된 리덕션 (reductions)을 감지하여, 이를 결합된 GPU 프로그램으로 실행하도록 계획합니다. 자동 오프로드 플래너 (auto-offload planner)는 손익분기점 (break-even points)을 추정하고 작업을 지능적으로 라우팅합니다:

퓨전 감지 (Fusion detection): 여러 연산을 단일 GPU 디스패치 (dispatches)로 결합합니다.
자동 오프로드 휴리스틱 (Auto-offload heuristics): 요소 개수, 리덕션 크기, 행렬 곱셈 포화도 (matrix multiply saturation)를 고려합니다.
레지던시 인식 (Residency awareness): 텐서 (tensors)를 장치에 유지하는 것이 효율적이라고 판단되면 장치에 계속 머물게 합니다.

3. GPU 커널 생성 (Generate GPU Kernels) — RunMat은 Metal (macOS), DirectX 12 (Windows), Vulkan (Linux) 전반에서 작동하는 이식 가능한 WGSL (WebGPU Shading Language) 커널을 생성합니다. 커널은 한 번 컴파일되면 캐시됩니다.

4. 효율적인 실행 (Execute Efficiently) — 런타임은 텐서를 한 번만 업로드하고, 퓨전된 커널을 GPU 메모리에서 직접 실행하며, 필요한 경우에만 결과를 수집함으로써 호스트 ↔ 장치 (host↔device) 간의 전송을 최소화합니다.

Insights

Show HN: RunMat – 밀집 수학 연산을 위한 자동 CPU/GPU 라우팅 런타임

요약

핵심 포인트

RunMat이란 무엇인가요?

RunMat이란 무엇인가요?

RunMat 사용 방법

주요 기능 요약

문서 (Documentation)

성능 (Performance)

설치 (Installation)

Linux 필수 요구 사항 (Linux prerequisite)

첫 번째 스크립트 실행하기 (Run Your First Script)

CLI 기능 (CLI Features)

Jupyter 통합 (Jupyter Integration)

주요 구성 요소 (Key components)

이것이 중요한 이유 (Why this matters)

댓글

AI 에이전트에게 저지연 '귀'가 필요한 이유: Domux 소개

Claude Code가 서브에이전트(subagent) 프롬프트 캐시에서 조용히 약 14%를 과다 청구하고 있으며, 이는 설정 변경이 아닌

Q8 양자화된 Qwen2.5-27B 모델로 32GB VRAM에서 100K 컨텍스트 달성 시도

Blackwell에서의 동시성(Concurrency) 및 nvfp4 활용

Claude Code가 서브에이전트(subagent) 프롬프트 캐시에서 조용히 약 14%를 과다 청구하고 있으며, 이는 설정 변경이 아닌

Q8 양자화된 Qwen2.5-27B 모델로 32GB VRAM에서 100K 컨텍스트 달성 시도

Blackwell에서의 동시성(Concurrency) 및 nvfp4 활용