HN분석2026. 04. 24. 23:43

LLM 작동 원리 완벽 해부: ChatGPT가 만들어지는 과정

요약

본 글은 Andrej Karpathy의 강의를 기반으로, GPT와 같은 대규모 언어 모델(LLMs)이 어떻게 구축되는지 단계별로 심층 분석합니다. 인터넷 데이터 수집부터 정제, 토큰화(Tokenization), 그리고 트랜스포머 아키텍처(Transformer Architecture)를 통한 훈련 과정까지 전 과정을 다룹니다. 핵심은 '데이터의 양과 질'이며, 모델이 단순히 다음 단어를 예측하는 통계적 패턴을 학습하여 인간 언어의 문법, 사실, 추론 능력까지 습득한다는 점입니다.

핵심 포인트

LLM의 기반 데이터는 Common Crawl 같은 웹 크롤링 데이터를 수집한 후, 필터링(URL Filtering), 중복 제거(Deduplication) 등을 거쳐 44TB 규모의 고품질 코퍼스(FineWeb Dataset)를 만듭니다.
텍스트는 BPE (Byte Pair Encoding) 알고리즘을 사용해 서브워드 토큰(sub-word tokens)으로 분해되며, 이는 단어의 무한한 변이형과 새로운 용어를 효율적으로 처리하게 합니다.
LLM은 트랜스포머 구조를 통해 임베딩 벡터(Embedding Vector)를 활용하여 문맥적 의미를 파악하며, 본질적으로 다음 토큰을 예측하는 통계적 과정입니다.
훈련된 모델은 추론(Inference) 단계에서 'Temperature' 같은 매개변수를 조절하여 확률 분포에 기반해 텍스트를 자가회귀적(autoregressively)으로 생성합니다.

A complete walkthrough of how large language models like ChatGPT are built

A complete walkthrough of how large language models like ChatGPT are built — from raw internet text to a conversational assistant. Based on Andrej Karpathy's technical deep dive.

Training Tokens: 15T
Parameters: 405B
Text Data: 44 TB
Token Vocabulary: 100K

Representative figures from frontier models circa 2024 — exact numbers shift with every release. The scale is the point, not the precision.

Chapter 1 · Pre-Training · Stage 1

Downloading the Internet

The first step is collecting an enormous amount of text. Organizations like Common Crawl have been crawling the web since 2007 — indexing 2.7 billion pages by 2024. This raw data is then filtered into a high-quality dataset like FineWeb.

The goal: large quantity of high quality, diverse documents. After aggressive filtering, you end up with about 44 terabytes — roughly 10 consumer hard drives worth of text — representing ~15 trillion tokens.

Key Insight: The quality and diversity of this training data has more impact on the final model than almost anything else. Garbage in, garbage out — but at a trillion-token scale.

🌐 Common Crawl

A non-profit organization that crawls the web and freely provides its data. Their bots follow links from seed pages, recursively indexing the internet. The raw archive is petabytes of gzip'd WARC files containing raw HTML.

🚫 URL Filtering

Blocklists · Malware · Spam · Adult content
Block-lists of known malware sites, spam networks, adult content, marketing pages, and low-quality domains are applied. Entire domains can be removed. This is the cheapest filter so it runs first.

📄 Text Extraction

HTML → clean text · Remove navigation & CSS
Raw HTML contains <div> tags, CSS, JavaScript, navigation menus, and ads. Parsers extract just the meaningful text content. This is harder than it sounds — heuristics decide what's "content" vs "chrome".

🌍 Language Filtering

A language classifier estimates the language of each page. Pages with less than 65% target-language content are dropped. This is a design decision — filter aggressively for one language or train multilingual.

♻️ Deduplication

Exact & fuzzy matching · Reduce repetition
Identical or near-identical pages appear millions of times on the internet (copied articles, boilerplate). Training on the same text repeatedly causes memorization. Dedup uses MinHash and exact-match techniques to remove duplicates.

🔒 PII Removal

Names · Addresses · SSNs · Emails
Personally Identifiable Information is detected and either redacted or the page is dropped. Regex patterns and ML classifiers find phone numbers, emails, Social Security numbers, physical addresses, and named individuals.

✅ FineWeb Dataset: 44 TB · 15 Trillion tokens · High quality
The final filtered dataset. Articles about tornadoes in 2012, medical facts, history, code, recipes, science papers — the full breadth of human knowledge expressed in text. This becomes the training corpus.

Chapter 1 · Pre-Training · Stage 2

Tokenization

Neural networks can't process raw text — they need numbers. The solution is tokenization: breaking text into "tokens" (sub-word chunks) and assigning each an ID.

GPT-4 uses a vocabulary of 100,277 tokens, built via the Byte Pair Encoding (BPE) algorithm. BPE starts with individual bytes (256 symbols), then iteratively merges the most frequent adjacent pairs — compressing the sequence length while expanding the vocabulary.

Why not just use words?
Words have infinite variants. "run", "running", "runner" would be 3 separate entries. Subword tokens share roots: "run" + "ning", "run" + "ner". This also handles new words, typos, and multiple languages efficiently.

BPE in Action

Step 1 of 5

The Transformer neural network is initialized with random parameters — billions of "knobs". Training adjusts these knobs so the network gets better at predicting the next token in any sequence.

Every training step: sample a window of tokens → feed to network → compare prediction to actual next token → nudge all parameters slightly in the right direction. Repeat billions of times.

The loss — a single number measuring prediction error — falls steadily as the model learns the statistical patterns of human language.

Scale: GPT-2 (2019): 1.6B params, 100B tokens, ~$40K to train. Today: same quality for ~$100. Llama 3: 405B params, 15T tokens. Modern frontier models: hundreds of billions of parameters, trillions of tokens.

Transformer Architecture

What is an Embedding?
Each token ID maps to a learned vector of ~1,000–4,000 numbers called its embedding. Think of it as a coordinate in meaning-space — initialized randomly, then shaped by training. The same token (e.g. "bank") always enters the network with the same embedding vector. Attention layers then mix in context from surrounding tokens, so by the time "bank" reaches deeper layers, "river bank" and "bank account" carry completely different representations. Polysemy is resolved by context, not by storing multiple meanings per token.

Model Output at This Stage

The model has learning but confusion still. The model bnsto predict...

What the model is learning: At step 1: pure noise. By step 500: local coherence appears. By step 32K: fluent English. The model is learning grammar, facts, reasoning patterns — all implicitly from token prediction.

Chapter 1 · Pre-Training · Stage 4

Inference & Token Sampling

Once trained, the network generates text autoregressively: feed a sequence of tokens → get a probability distribution over all 100K possible next tokens → sample one → append → repeat.

This process is stochastic — the same prompt generates different outputs every time because we're flipping a biased coin. Higher-probability tokens are more likely but not guaranteed to be chosen.

Temperature controls randomness. Low temperature (0.1) → model always picks the top token. High temperature (2.0) → uniform chaos. 0.7–1.0 is the sweet spot for coherent-but-creative text.

Key Mental Model: The model doesn't "think" about what to say. It computes a probability distribution over all possible next tokens and samples from it. Every word is a coin flip — just a very informed one.

Chapter 2 · The Base Model

The Internet Simulator

After pre-training, you have a base model — a sophisticated autocomplete engine. It's not an assistant. It doesn't answer questions. It continues token sequences based on what it saw on the internet.

Give it a Wikipedia sentence and it'll complete it from memory. Ask it "What is 2+2?" and it might give you a math textbook page, a quiz answer key, or go off on a tangent — whatever was statistically common in its training data.

The base model's knowledge lives in its 405 billion parameters — a lossy compression of the internet, like a zip file that approximates rather than perfectly stores information.

Example: Zebras (/ˈzɛbrə, ˈziːbrə/) are African equines with distinctive... ...black-and-white striped coats. There are three living species: the Grévy's zebra, plains zebra, and mountain zebra... ↑ Verbatim Wikipedia recall from weights

Example: The Republican Party nominated Trump and [running mate] in the 2024 election against... → ...Mike Pence, facing Hillary Clinton and Tim Kaine... → ...Ron DeSantis, against Joe Biden and Kamala Harris... ↑ Knowledge cutoff → plausible confabulation

In-Context Learning: Base models can perform translation, classification, and Q&A via few-shot prompts — no fine-tuning needed. The model infers the task from the pattern of examples in its context window.

Chapter 3 · Post-Training

Building the Assistant

The base model is a token simulator. To turn it into a helpful assistant, we need post-training — a much cheaper but equally critical stage. This is where the model learns conversations.

Supervised Fine-Tuning (SFT)

Human labelers create a dataset of ideal conversations, following detailed labeling instructions: be helpful, be truthful, be harmless. The model is then trained on these conversations — not from scratch, but by continuing to adjust the pre-trained weights on this new data.

Modern SFT datasets (like UltraChat) have millions of conversations — mostly synthetic (LLM-generated), with human review. The model learns by imitation: it adopts the persona of the ideal assistant reflected in the data.

Training Conversation Example:

Human: What is 2 + 2?
Assistant: 2 + 2 = 4. Is there anything else you'd like help with?
Human: What if it was multiplication instead?
Assistant: 2 × 2 = 4 as well — the same result! For multiplication, 2 × 2 means adding 2 to itself once, giving you 4.

What you're really talking to: ChatGPT is a statistical simulation of the human labelers OpenAI hired — experts following labeling instructions. When it answers a coding question, it's imitating what a skilled developer-labeler would write.

Then RLHF refines the assistant's behavior further:

RLHF — Reinforcement Learning from Human Feedback

Human raters rank multiple model responses. A reward model learns to predict human preferences. The language model is then trained via reinforcement learning to generate responses the reward model scores highly.

Example:

✓ Preferred: Here are the top 5 landmarks in Paris: 1) Eiffel Tower — iconic iron lattice structure... 2) The Louvre — world's largest art museum...
✗ Rejected: Paris has many landmarks. You should visit the Eiffel Tower. There is also a museum called the Louvre. Also Notre-Dame Cathedral is there...

Why RLHF matters: SFT teaches the model what to say. RLHF teaches it how to say it well — making responses more helpful, better structured, more honest, and less likely to hallucinate.

Chapter 4 · LLM Psychology

Cognitive Quirks of Language Models

Understanding why LLMs behave the way they do requires thinking about their psychology — the emergent properties of being trained to statistically imitate human text.

🌀 Hallucination: Models confabulate confidently because training data always has confident answers. "Who is Orson Kovats?" gets a made-up biography because the training distribution of "who is X?" questions is always followed by confident replies — even for fictional names. Fix: add "I don't know" examples for questions the model gets wrong consistently.
🧠 Two Types of Memory: Parameters = long-term memory. Everything the model learned during training — vast but vague, like something you read months ago. Context window = working memory. Text in the current conversation — precise, directly accessible. Always paste important info into context rather than relying on the model to "remember".
🔧 Tool Use: Models can emit special tokens that trigger external tools: <search>query</search>. The program pauses generation, executes the search, stuffs the results into the context window, then resumes. The model "looks things up" the same way you do — by refreshing working memory.
🪞 No Persistent Self: Each conversation starts fresh — no memory of prior chats. The model "boots up," processes tokens, then shuts off. It has no stable identity. When it says "I'm ChatGPT by OpenAI," that's just the most statistically likely answer from training data — not genuine self-knowledge.
📊 Stochastic Token Tumbler: The model doesn't "decide" what to say. It computes probability distributions and samples. Run the same prompt 10 times and get 10 different outputs —

AI 자동 생성 콘텐츠

원문 바로가기