본문으로 건너뛰기

© 2026 Molayo

HN중요요약2026. 04. 24. 12:22

LLM 시대 필수 가이드: 데이터 엔지니어링 북 (Data Engineering Book)

요약

본 책은 대규모 언어 모델(LLM)의 성능을 좌우하는 핵심 요소인 '데이터'에 초점을 맞춘 오픈 소스 커뮤니티 가이드입니다. 단순한 이론을 넘어, 전처리 데이터 클리닝부터 멀티모달 정렬, RAG 파이프라인 구축까지 LLM 데이터 라이프사이클 전체를 체계적으로 다룹니다. 6개 파트와 13개의 챕터, 그리고 실제 코드를 포함한 5가지 엔드투엔드 프로젝트(예: Mini-C4 전처리 세트 구축, 법률 도메인 SFT 등)를 제공하여 실무 개발자가 즉시 적용 가능한 깊이 있는 지식을 얻을 수 있도록 설계되었습니다.

핵심 포인트

  • LLM 성능의 상한선은 데이터 품질에 의해 결정되므로, 체계적인 데이터 엔지니어링 과정(Data Ops → AI Ops) 이해가 필수적입니다.
  • 본 가이드는 전처리(Pre-training), 멀티모달 처리, 정렬 데이터 구축(SFT/RLHF), RAG 파이프라인 등 LLM의 전체 데이터 수명주기를 포괄합니다.
  • 실습 중심 학습을 위해 Mini-C4 세트 구축, 법률 도메인 SFT, LLaVA 멀티모달 명령어 세트 등 5가지 엔드투엔드 프로젝트와 코드를 제공합니다.
  • Ray Data, Spark, Dask 같은 분산 컴퓨팅 기술과 Parquet, Vector Databases(Milvus/Qdrant) 등의 최신 스택을 다룹니다.

Show HN: Data Engineering Book – An open source, community-driven guide

"Data is the new oil, but only if you know how to refine it."

In the era of large models, data quality determines the upper bound of model performance. Yet systematic resources on LLM data engineering remain extremely scarce — most teams are still learning by trial and error.

This book is designed to fill that gap. We systematically cover the complete technical stack from pre-training data cleaning to multimodal alignment, from RAG retrieval augmentation to synthetic data generation, including:

  • 🧹 Pre-training Data Engineering: Extracting high-quality corpora from massive noisy data sources like Common Crawl
  • 🖼️ Multimodal Data Processing: Collection, cleaning, and alignment of image-text pairs, video, and audio data
  • 🎯 Alignment Data Construction: Automated generation of SFT instruction data, RLHF preference data, and CoT reasoning data
  • 🔍 RAG Data Pipeline: Enterprise-grade document parsing, semantic chunking, and multimodal retrieval

Beyond in-depth theoretical explanations, the book includes 5 end-to-end capstone projects with runnable code and detailed architecture designs for hands-on learning.

Read Online

A complete data engineering pipeline from raw data to end-to-end applications

📖 6 Parts, 13 Chapters + 5 Capstone Projects

  • Part 1: Infrastructure & Core Concepts

    • Chapter 1: Data Revolution in the LLM Era (From Data Ops to AI Ops)
    • Chapter 2: AI-Native Data Stack
  • Part 2: Large-Scale Text Pre-training Engineering

    • Chapter 3: Data Acquisition
    • Chapter 4: Cleaning & Quality Control
    • Chapter 5: Tokenization, Serialization & Efficient Loading
  • Part 3: Multimodal Data Engineering

    • Chapter 6: Image-Text Pair Processing
    • Chapter 7: Recaptioning
    • Chapter 8: Video & Audio Data
  • Part 4: Alignment & Synthetic Data Engineering

    • Chapter 9: Instruction Fine-tuning Data
    • Chapter 10: Synthetic Data
    • Chapter 11: Human Preference Data
  • Part 5: Application-level Data Engineering

    • Chapter 12: RAG Data Pipeline
    • Chapter 13: Multimodal RAG
  • Part 6: Capstone Projects

    • Project 1: Building Mini-C4 Pre-training Set
    • Project 2: Domain Expert SFT (Legal)
    • Project 3: Building LLaVA Multimodal Instruction Set
    • Project 4: Synthetic Math/Code Textbook
    • Project 5: Multimodal RAG Financial Report Assistant

  • Data-Centric AI philosophy throughout
  • Covers the full LLM data lifecycle: Pre-training → Fine-tuning → RLHF → RAG
  • In-depth coverage of Scaling Laws, data quality evaluation, multimodal alignment, and more
DomainTechnologies
Distributed ComputingRay Data, Spark, Dask
Data StorageParquet, WebDataset, Vector Databases (Milvus/Qdrant)
Text ProcessingTrafilatura, KenLM, MinHash LSH, fastText Quality Scoring
MultimodalCLIP, ColPali, img2dataset
Data VersioningDVC, LakeFS, Pachyderm
**ProjectCore TechnologiesOutput**
Mini-C4 Pre-training SetTrafilatura + Ray + MinHashHigh-quality text corpus
Legal Expert SFTSelf-Instruct + CoTDomain instruction dataset
LLaVA MultimodalBbox alignment + multi-image interleavingVisual instruction dataset
Math TextbookEvol-Instruct + sandbox verificationPoT reasoning dataset
Financial Report RAGColPali + Qwen-VLMultimodal QA system

Setup & Usage

  • Python 3.8+
  • MkDocs Material
  • mkdocs-static-i18n (i18n support)
# Clone the repository
git clone https://github.com/datascale-ai/data_engineering_book.git
cd data_engineering_book
# Install dependencies
pip install mkdocs-material mkdocs-glightbox pymdown-extensions "mkdocs-static-i18n[material]"
# Local preview
mkdocs serve
Visit http://127.0.0.1:8000 to preview the book (with Chinese/English/Japanese language switcher).
mkdocs build
The generated static files are located in the site/ directory.

Directory Structure:
data_engineering_book/
├── docs/
│ ├── zh/ # Chinese content
│ │ ├── index.md # Chinese homepage
│ │ └── part1/ ~ part6/ # All chapters
│ ├── en/ # English content
│ │ ├── index.md # English homepage
│ │ └── part1/ ~ part6/ # All chapters
│ ├── ja/ # Japanese content
│ │ ├── index.md # Japanese homepage
│ │ └── part1/ ~ part6/ # All chapters
│ ├── images/ # Image assets (shared)
│ ├── stylesheets/ # Custom styles
│ └── javascripts/ # JavaScript (MathJax etc.)
├── .github/workflows/ # GitHub Actions CI/CD
├── images/ # Project image assets
│ ├── structure_cn.png # Book architecture diagram (Chinese)
│ └── structure_en.png # Book architecture diagram (English)
├── mkdocs.yml # MkDocs configuration
├── LICENSE # License
├── README.md # 中文说明
└── README_en.md # English README (this file)

Target Audience:

  • LLM R&D Engineers
  • Data Engineers / MLOps Engineers
  • AI Product Managers (Technical)
  • Researchers interested in LLM data pipelines

Professor Jun Yu's Team
Laboratory Information:
National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China;
Multimedia Computing and Intelligent Robotics Research Center, Department of Automation, University of Science and Technology of China;
Joint Research Center for Multi-Modal Intelligent Agents, Department of Automation, University of Science and Technology of China

Contributions are welcome! Feel free to submit Issues and Pull Requests.

  • Fork this repository
  • Create a feature branch (git checkout -b feature/AmazingFeature)
  • Commit your changes (git commit -m 'Add some AmazingFeature')
  • Push to the branch (git push origin feature/AmazingFeature)
  • Open a Pull Request

This project is licensed under the MIT License - see the LICENSE file for details.

AI 자동 생성 콘텐츠

본 콘텐츠는 HN AI Engineering의 원문을 AI가 자동으로 요약·번역·분석한 것입니다. 원 저작권은 원저작자에게 있으며, 정확한 내용은 반드시 원문을 확인해 주세요.

원문 바로가기
6

댓글

0