LLM 시대 필수 가이드: 데이터 엔지니어링 북 (Data Engineering Book)

요약

본 책은 대규모 언어 모델(LLM)의 성능을 좌우하는 핵심 요소인 '데이터'에 초점을 맞춘 오픈 소스 커뮤니티 가이드입니다. 단순한 이론을 넘어, 전처리 데이터 클리닝부터 멀티모달 정렬, RAG 파이프라인 구축까지 LLM 데이터 라이프사이클 전체를 체계적으로 다룹니다. 6개 파트와 13개의 챕터, 그리고 실제 코드를 포함한 5가지 엔드투엔드 프로젝트(예: Mini-C4 전처리 세트 구축, 법률 도메인 SFT 등)를 제공하여 실무 개발자가 즉시 적용 가능한 깊이 있는 지식을 얻을 수 있도록 설계되었습니다.

핵심 포인트

LLM 성능의 상한선은 데이터 품질에 의해 결정되므로, 체계적인 데이터 엔지니어링 과정(Data Ops → AI Ops) 이해가 필수적입니다.
본 가이드는 전처리(Pre-training), 멀티모달 처리, 정렬 데이터 구축(SFT/RLHF), RAG 파이프라인 등 LLM의 전체 데이터 수명주기를 포괄합니다.
실습 중심 학습을 위해 Mini-C4 세트 구축, 법률 도메인 SFT, LLaVA 멀티모달 명령어 세트 등 5가지 엔드투엔드 프로젝트와 코드를 제공합니다.
Ray Data, Spark, Dask 같은 분산 컴퓨팅 기술과 Parquet, Vector Databases(Milvus/Qdrant) 등의 최신 스택을 다룹니다.

Show HN: Data Engineering Book – An open source, community-driven guide

"Data is the new oil, but only if you know how to refine it."

In the era of large models, data quality determines the upper bound of model performance. Yet systematic resources on LLM data engineering remain extremely scarce — most teams are still learning by trial and error.

This book is designed to fill that gap. We systematically cover the complete technical stack from pre-training data cleaning to multimodal alignment, from RAG retrieval augmentation to synthetic data generation, including:

🧹 Pre-training Data Engineering: Extracting high-quality corpora from massive noisy data sources like Common Crawl
🖼️ Multimodal Data Processing: Collection, cleaning, and alignment of image-text pairs, video, and audio data
🎯 Alignment Data Construction: Automated generation of SFT instruction data, RLHF preference data, and CoT reasoning data
🔍 RAG Data Pipeline: Enterprise-grade document parsing, semantic chunking, and multimodal retrieval

Beyond in-depth theoretical explanations, the book includes 5 end-to-end capstone projects with runnable code and detailed architecture designs for hands-on learning.

Read Online

A complete data engineering pipeline from raw data to end-to-end applications

📖 6 Parts, 13 Chapters + 5 Capstone Projects

Part 1: Infrastructure & Core Concepts
- Chapter 1: Data Revolution in the LLM Era (From Data Ops to AI Ops)
- Chapter 2: AI-Native Data Stack
Part 2: Large-Scale Text Pre-training Engineering
- Chapter 3: Data Acquisition
- Chapter 4: Cleaning & Quality Control
- Chapter 5: Tokenization, Serialization & Efficient Loading
Part 3: Multimodal Data Engineering
- Chapter 6: Image-Text Pair Processing
- Chapter 7: Recaptioning
- Chapter 8: Video & Audio Data
Part 4: Alignment & Synthetic Data Engineering
- Chapter 9: Instruction Fine-tuning Data
- Chapter 10: Synthetic Data
- Chapter 11: Human Preference Data
Part 5: Application-level Data Engineering
- Chapter 12: RAG Data Pipeline
- Chapter 13: Multimodal RAG
Part 6: Capstone Projects
- Project 1: Building Mini-C4 Pre-training Set
- Project 2: Domain Expert SFT (Legal)
- Project 3: Building LLaVA Multimodal Instruction Set
- Project 4: Synthetic Math/Code Textbook
- Project 5: Multimodal RAG Financial Report Assistant

Data-Centric AI philosophy throughout
Covers the full LLM data lifecycle: Pre-training → Fine-tuning → RLHF → RAG
In-depth coverage of Scaling Laws, data quality evaluation, multimodal alignment, and more

Domain	Technologies
Distributed Computing	Ray Data, Spark, Dask
Data Storage	Parquet, WebDataset, Vector Databases (Milvus/Qdrant)
Text Processing	Trafilatura, KenLM, MinHash LSH, fastText Quality Scoring
Multimodal	CLIP, ColPali, img2dataset
Data Versioning	DVC, LakeFS, Pachyderm

**Project	Core Technologies	Output**
Mini-C4 Pre-training Set	Trafilatura + Ray + MinHash	High-quality text corpus
Legal Expert SFT	Self-Instruct + CoT	Domain instruction dataset
LLaVA Multimodal	Bbox alignment + multi-image interleaving	Visual instruction dataset
Math Textbook	Evol-Instruct + sandbox verification	PoT reasoning dataset
Financial Report RAG	ColPali + Qwen-VL	Multimodal QA system

Setup & Usage

Python 3.8+
MkDocs Material
mkdocs-static-i18n (i18n support)

# Clone the repository
git clone https://github.com/datascale-ai/data_engineering_book.git
cd data_engineering_book
# Install dependencies
pip install mkdocs-material mkdocs-glightbox pymdown-extensions "mkdocs-static-i18n[material]"
# Local preview
mkdocs serve
Visit http://127.0.0.1:8000 to preview the book (with Chinese/English/Japanese language switcher).
mkdocs build
The generated static files are located in the site/ directory.

Directory Structure:
data_engineering_book/
├── docs/
│ ├── zh/ # Chinese content
│ │ ├── index.md # Chinese homepage
│ │ └── part1/ ~ part6/ # All chapters
│ ├── en/ # English content
│ │ ├── index.md # English homepage
│ │ └── part1/ ~ part6/ # All chapters
│ ├── ja/ # Japanese content
│ │ ├── index.md # Japanese homepage
│ │ └── part1/ ~ part6/ # All chapters
│ ├── images/ # Image assets (shared)
│ ├── stylesheets/ # Custom styles
│ └── javascripts/ # JavaScript (MathJax etc.)
├── .github/workflows/ # GitHub Actions CI/CD
├── images/ # Project image assets
│ ├── structure_cn.png # Book architecture diagram (Chinese)
│ └── structure_en.png # Book architecture diagram (English)
├── mkdocs.yml # MkDocs configuration
├── LICENSE # License
├── README.md # 中文说明
└── README_en.md # English README (this file)

Target Audience:

LLM R&D Engineers
Data Engineers / MLOps Engineers
AI Product Managers (Technical)
Researchers interested in LLM data pipelines

Professor Jun Yu's Team
Laboratory Information:
National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China;
Multimedia Computing and Intelligent Robotics Research Center, Department of Automation, University of Science and Technology of China;
Joint Research Center for Multi-Modal Intelligent Agents, Department of Automation, University of Science and Technology of China

Contributions are welcome! Feel free to submit Issues and Pull Requests.

Fork this repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

This project is licensed under the MIT License - see the LICENSE file for details.

GitHub Issues: Submit an issue
Read Online: https://datascale-ai.github.io/data_engineering_book/en/

AI 자동 생성 콘텐츠

원문 바로가기

LLM 시대 필수 가이드: 데이터 엔지니어링 북 (Data Engineering Book)

요약

핵심 포인트

Show HN: Data Engineering Book – An open source, community-driven guide

댓글