고급 청킹 라이브러리 Chonkie 출시: RAG 성능 혁신을 위한 오픈소스 솔루션
요약
Chonkie는 검색 증강 생성(RAG) 애플리케이션의 핵심 단계인 데이터 청킹 및 임베딩을 위한 경량 오픈소스 라이브러리입니다. 기존 라이브러리의 복잡성이나 낮은 성능에 대한 한계를 극복하고자 개발되었습니다. Chonkie는 토큰, 문장, 재귀적 방식 외에도 'Semantic Double Pass Chunking', 코드 기반의 'Code Chunking' 등 최신 연구 결과를 반영한 8가지 고급 청킹 전략을 지원합니다. 특히, 기존 대비 설치 용량이 훨씬 작고(기본 설치 약 15MB), 토큰 청킹 속도가 LangChain/Lla
핵심 포인트
- Chonkie는 RAG 파이프라인의 핵심인 데이터 청킹 및 임베딩을 위한 경량 오픈소스 라이브러리입니다.
- 토큰, 문장 기반 방식 외에도 'Semantic Double Pass Chunking', 'Code Chunking' 등 8가지 고급 전략을 지원하며 최신 연구 결과를 반영했습니다.
- 기존 대안 대비 설치 용량이 현저히 작고(약 15MB), 토큰 청킹 속도가 LangChain/LlamaIndex보다 최대 33배 빠릅니다.
- pgVector, Chroma, Qdrant 등 주요 벡터 데이터베이스와의 'handshake' 기능을 제공하여 검색 및 임베딩 과정을 간소화했습니다.
Launch HN: Chonkie (YC X25) – Open-Source Library for Advanced Chunking
Hey HN! We're Shreyash and Bhavnick. We're building Chonkie (
https://chonkie.ai), an open-source library for chunking and embedding data.
Python: https://github.com/chonkie-inc/chonkie
TypeScript: https://github.com/chonkie-inc/chonkie-ts
Here's a video showing our code chunker: https://youtu.be/Xclkh6bU1P0.
Bhavnick and I have been building personal projects with LLMs for a few years. For much of this time, we found ourselves writing our own chunking logic to support RAG applications. We often hesitated to use existing libraries because they either had only basic features or felt too bloated (some are 80MB+).
We built Chonkie to be lightweight, fast, extensible, and easy. The space is evolving rapidly, and we wanted Chonkie to be able to quickly support the newest strategies. We currently support:
- Token Chunking,
- Sentence Chunking,
- Recursive Chunking,
- Semantic Chunking,
- Semantic Double Pass Chunking: Chunks text semantically first, then merges closely related chunks.
- Code Chunking: Chunks code files by creating an AST and finding ideal split points.
- Late Chunking: Based on the paper (https://arxiv.org/abs/2409.04701), where chunk embeddings are derived from embedding a longer document.
- Slumber Chunking: Based on the "Lumber Chunking" paper (https://arxiv.org/abs/2406.17526). It uses recursive chunking, then an LLM verifies split points, aiming for high-quality chunks with reduced token usage and LLM costs.
You can see how Chonkie compares to LangChain and LlamaIndex in our benchmarks: https://github.com/chonkie-inc/chonkie/blob/main/BENCHMARKS....
Some technical details about the Chonkie package:
- ~15MB default install vs. ~80-170MB for some alternatives.
- Up to 33x faster token chunking compared to LangChain and LlamaIndex in our tests.
- Works with major tokenizers (transformers, tokenizers, tiktoken).
- Zero external dependencies for basic functionality.
- Implements aggressive caching and precomputation.
- Uses running mean pooling for efficient semantic chunking.
- Modular dependency system (install only what you need).
In addition to chunking, Chonkie also provides an easy way to create embeddings. For supported providers (SentenceTransformer, Model2Vec, OpenAI), you just specify the model name as a string. You can also create custom embedding handlers for other providers.
RAG is still the most common use case currently. However, Chonkie makes chunks that are optimized for creating high quality embeddings and vector retrieval, so it is not really tied to the "generation" part of RAG. In fact, We're seeing more and more people use Chonkie for implementing semantic search and/or setting context for agents.
We are currently focused on building integrations to simplify the retrieval process. We've created "handshakes" – thin functions that interact with vector DBs like pgVector, Chroma, TurboPuffer, and Qdrant, allowing you to interact with storage easily. If there's an integration you'd like to see (vector DB or otherwise), please let us know.
We also offer hosted and on-premise versions with OCR, extra metadata, all embedding providers, and managed vector databases for teams that want a fully managed pipeline. If you're interested, reach out at shreyash@chonkie.ai or book a demo: https://cal.com/shreyashn/chonkie-demo.
We're eager to hear your feedback and comments! Thanks!
AI 자동 생성 콘텐츠
본 콘텐츠는 HN AI Engineering의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.
원문 바로가기