HN중요요약2026. 04. 24. 13:08

Chonky: 신경망 기반의 지능형 텍스트 의미 단위 분할 라이브러리

요약

Chonky는 파인튜닝된 트랜스포머 모델을 활용하여 텍스트를 의미론적으로(semantically) 가장 적절한 청크(chunk)로 자동 분할하는 Python 라이브러리입니다. RAG (Retrieval-Augmented Generation) 시스템의 성능 향상에 필수적인 핵심 컴포넌트입니다. 사용자는 `ParagraphSplitter` 클래스를 통해 텍스트를 입력하고, 모델이 문맥적 의미 경계를 파악하여 고품질의 청크 배열을 얻을 수 있습니다. 특히 마크다운(markdown), XML, HTML 등 다양한 형식의 구조화된 문서에서

핵심 포인트

Chonky는 트랜스포머 모델 기반으로 텍스트를 의미론적 청크로 분할하며, RAG 시스템에 최적화되어 있습니다.
다양한 형식(Markdown, XML, HTML)을 지원하는 `MarkupRemover` 클래스를 통해 구조화된 문서에서 순수 텍스트 추출이 용이합니다.
최신 모델인 `mirth/chonky_mmbert_small_multilingual_1`은 다국어 성능(예: 스페인어 0.91, 러시아어 0.97)에서 높은 수치를 기록했습니다.
성능 지표를 보면 Chonky의 모델들이 기존 SaT (Semantic Text Splitter)나 LangChain 등 다른 라이브러리 대비 월등히 높은 F1 점수를 보여줍니다.

Chonky is a Python library that intelligently segments text into meaningful semantic chunks using a fine-tuned transformer model. This library can be used in the RAG systems.

pip install chonky
from chonky import ParagraphSplitter
# on the first run it will download the transformer model
splitter = ParagraphSplitter(device="cpu")
# Or you can select the model
# splitter = ParagraphSplitter(
# model_id="mirth/chonky_modernbert_base_1",
# device="cpu"
# )
text = (
"Before college the two main things I worked on, outside of school, were writing and programming. "
"I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. "
"My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. "
"The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' "
"This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, "
"and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — "
"CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."
)
for chunk in splitter(text):
print(chunk)
print("--")

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it.

It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The usage pattern is the following: strip all the markup tags to produce pure text and feed this text into the splitter. For this purpose there is helper class MarkupRemover (it automatically detects the content format):

from chonky.markup_remover import MarkupRemover
from chonky import ParagraphSplitter
remover = MarkupRemover()
splitter = ParagraphSplitter()
text = remover("# Header 1 ...")
splitter(text)

Supported formats: markdown, xml, html.

Model ID	Seq Length	Number of Params	Multilingual
mirth/chonky_modernbert_large_1	1024	396M	❌
mirth/chonky_modernbert_base_1	1024	150M	❌
mirth/chonky_mmbert_small_multilingual_1 🆕	1024	140M	✅
mirth/chonky_distilbert_base_uncased_1	512	66.4M	❌

The following values are token based F1 scores computed on first 1M tokens of each datasets (due to performance reasons).
The do_ps fragment for SaT models here is do_paragraph_segmentation flag.

Model	bookcorpus	en_judgements	paul_graham	20_newsgroups
chonkY_modernbert_large_1	0.79 ❗	0.29 ❗	0.69 ❗	0.17
chonkY_modernbert_base_1	0.72	0.08	0.63	0.15
chonkY_mmbert_small_multilingual_1 🆕	0.72	0.2	0.56	0.13
chonkY_distilbert_base_uncased_1	0.69	0.05	0.52	0.15
SaT(sat-12l-sm, do_ps=False)	0.33	0.03	0.43	0.31
SaT(sat-12l-sm, do_ps=True)	0.33	0.06	0.42	0.3
SaT(sat-3l, do_ps=False)	0.28	0.03	0.42	0.34 ❗
SaT(sat-3l, do_ps=True)	0.09	0.07	0.41	0.15
chonkIE SemanticChunker(bge-small-en-v1.5)	0.21	0.01	0.12	0.06
chonkIE SemanticChunker(potion-base-8M)	0.19	0.01	0.15	0.08
chonkIE RecursiveChunker	0.07	0.01	0.05	0.02
langchain SemanticChunker(all-mpnet-base-v2)	0	0	0	0
langchain SemanticChunker(bge-small-en-v1.5)	0	0	0	0
langchain SemanticChunker(potion-base-8M)	0	0	0	0
langchain RecursiveChar	0	0	0	0
llamaindex SemanticSplitter(bge-small-en-v1.5)	0.06	0	0.06	0.02

| Model | de | en | es | fr | it | nl | pl | pt | ru | sv | zh | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| chonky_mmbert_small_multi_1 🆕 | 0.88 ❗ | 0.78 ❗ | 0.91 ❗ | 0.93 ❗ | 0.86 ❗ | 0.81 ❗ | 0.81 ❗ | 0.88 ❗ | 0.97 ❗ | 0.91 ❗ | 0.11 |
| chonky_modernbert_large_1 | 0.53 | 0.43 | 0.48 | 0.51 | 0.56 | 0.21 | 0.65 | 0.53 | 0.87 | 0.51 | 0.33 ❗ |
| chonky_modernbert_base_1 | 0.42 | 0.38 | 0.34 | 0.4 | 0.33 | 0.22 | 0.41 | 0.35 | 0.27 | 0.31 | 0.26 |
| chonky_distilbert_base_uncased_1 | 0.19 | 0.3 | 0.17 | 0.2 | 0.18 | 0.04 | 0.27 | 0.21 | 0.22 | 0.19 | 0.15 |
| Number of val tokens | 1M | 1M | 1M | 1M | 1M | 1M | 38K | 1M | 24K | 1M | 132K |

AI 자동 생성 콘텐츠

원문 바로가기

Chonky: 신경망 기반의 지능형 텍스트 의미 단위 분할 라이브러리

요약

핵심 포인트

Chonky is a Python library that intelligently segments text into meaningful semantic chunks using a fine-tuned transformer model. This library can be used in the RAG systems.

It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

댓글