데이터-임베딩-API 파이프라인을 위한 함수형 DSL의 필요성
요약
RAG(Retrieval-Augmented Generation) 워크로드 증가로 인해, 구조화된 데이터(JSONL)를 읽어 임베딩 벡터로 변환하고 외부 API/벡터 DB에 배치 전송하는 과정이 필수적입니다. 하지만 현재는 복잡한 중첩 for-loop와 수동적인 JSON 파싱 등 지저분하고 비효율적인 명령형 Python 코드로 처리해야 합니다. 본 글은 이 과정을 Unix 파이프나 Lisp/Elixir 같은 함수형 언어의 DSL처럼 선언적(declarative)이고 컴포저블한 방식으로 처리할 수 있는 전용 프레임워크 또는 DSL의 부
핵심 포인트
- 현재 데이터-임베딩 파이프라인은 JSONL 파일 읽기, 키 기반 매칭, (텍스트, 임베딩) 쌍 변환, 배치(batching), 외부 시스템 전송 등 복잡한 단계를 거치며 비효율적인 명령형 Python 코드를 요구합니다.
- 필자는 이 과정을 Unix 파이프(`|`)나 Lisp/Elixir의 함수형 구문처럼 `Source | Match | Transform | Filter | Batch | Sink`와 같은 선언적이고 컴포저블한 DSL로 구현해야 한다고 주장합니다.
- RAG(Retrieval-Augmented Generation)가 표준화되면서 이 패턴이 일반화되었음에도 불구하고, 기존의 Pandas나 Dask 등 범용 라이브러리로는 '구조화된 데이터 -> 변환 -> API 전송'이라는 특정 워크로드를 통합적으로 해결하기 어렵습니다.
- 이상적인 솔루션은 `cat input.jsonl | match output.jsonl on custom_id | extract (text, embedding) | filter not-empty | batch 50 | send-to-chroma`와 같이 직관적이고 파이프라인 지향적인 구문이어야 합니다.
Ask HN: Why don't we have a functional DSL for data+embedding+API pipelines?
I’ve been working on a pretty common problem:
- I have structured data in JSONL files (in.jsonl, out.jsonl)
- I match lines by a key
- I transform them into (text, embedding) pairs
- I optionally filter/map them
- I batch them (into chunks of 50)
- I push each batch into an external system (e.g. vector DB, Chroma)
That’s it. Sounds trivial. But it turns into ugly imperative Python code very quickly: nested for-loops, global indices, +=, manual batching, line-by-line handling, low-level JSON parsing.
Here’s what it usually looks like in Python:
with open("in.json", "r") as fin:
with open("out.json", "r") as fout:
for in_line, out_line in zip(fin, fout):
in_data = json.loads(in_line)
out_data = json.loads(out_line)
if in_data["custom_id"] != out_data["custom_id"]:
raise Exception...
texts = in_data["body"]["input"]
embeddings = [d["embedding"] for d in out_data["response"]["body"]["data"]]
for i in range(len(texts)):
doc = texts[i]
emb = embeddings[i]
metadata = {
"source": f"chunk-{global_ids}",
We’re in 2025, and this is how we’re wiring data into APIs.
Why do we tolerate this?
This is a declarative, streaming, data processing problem. Why aren’t we using something more elegant? Something more composable, like functional pipelines?
I'm asking myself: Why don’t we have a composable, streaming, functional DSL for this kind of task?
Why not build it like Unix pipes?
What I want is something that feels like:
cat input.jsonl \
| match output.jsonl on custom_id \
| extract (text, embedding) \
| filter not-empty \
| batch 50 \
| send-to-chroma
In Lisp / Clojure:
(->> (zip input output)
(filter (= :custom_id))
(mapcat (fn [[in out]] (zip (:input in) (:embedding out))))
(partition-all 50)
(map send-to-chroma))
In Elixir + Broadway:
Broadway
|> read_stream("in.jsonl", "out.jsonl")
|> match_on(:custom_id)
|> map(&{&1.text, &1.embedding})
|> batch_every(50)
|> send_to_chroma()
And now, back to Python..
We’re stuck writing imperative soup or building hacky DSLs with things like:
load_json_pairs()
| where(is_valid)
| select(to_embedding_record)
| batch(50)
| foreach(send_to_chroma)
...or, more realistically, writing thousands of lines of with open(...) as f.
And even though libraries like tf.data.Dataset, dask.bag, pandas, or pipe exist, none of them really solve this use case in a cohesive and expressive way. They all focus on either tabular data, or big data, or ML input pipelines – not this "structured data -> transform -> push to API" pattern.
This is especially absurd now that everyone’s doing RAG
With Retrieval-Augmented Generation (RAG) becoming the norm, we’re all parsing files, extracting embeddings, enriching metadata, batching, and inserting into vector stores.
Why are we all writing the same low-level, ad-hoc code to do this?
Shouldn’t this entire category of work be addressed by proper DSL/framework?
Wouldn’t it make sense to build...
- a functional DSL for JSON-to-embedding-to-API pipelines?
- or a Python library with proper map, filter, batch, pipe, sink semantics?
- or even a streaming runtime like Elixir Broadway or a minimal functional Rx-style graph?
Even R with dplyr has more elegant ways to express transformation than what we do in Python for these jobs.
Am I missing something?
Is there a tool, a language, or a framework out there that actually solves this well?
Or is this just one of those gaps in the tooling ecosystem that no one has filled yet?
Would love to hear what others are doing – and if anyone’s already working on a solution like this.
Thanks.
AI 자동 생성 콘텐츠
본 콘텐츠는 HN AI Engineering의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.
원문 바로가기