Lilian헤드라인2026. 05. 07. 05:57

The Transformer Family

요약

이 기술 기사는 트랜스포머(Transformer) 모델의 핵심 아키텍처를 설명합니다. 인코더는 어텐션 기반 표현을 생성하여 대규모 컨텍스트에서 특정 정보를 찾는 역할을 하며, 디코더는 이 인코딩된 표현으로부터 정보를 검색하는 역할을 합니다. 또한, 순서 정보가 필수적인 트랜스포머에 위치 인코딩(Positional Encoding)의 중요성을 강조하며, 사인파 기반 및 학습 가능한 두 가지 방식을 소개합니다.

핵심 포인트

트랜스포머 인코더는 멀티-헤드 셀프 어텐션과 피처 네트워크를 포함하여 컨텍스트에서 정보를 추출하는 역할을 합니다.
디코더는 마스크 처리된(masked) 멀티-헤드 어텐션을 사용하여 인코딩된 표현으로부터 순차적으로 정보를 검색합니다.
셀프 어텐션은 순서 정보가 없으므로, 위치 인코딩(Positional Encoding)을 통해 토큰의 순서를 모델에 제공해야 합니다.
트랜스포머는 추가적인 손실 함수(auxiliary losses)를 사용하여 깊은 모델 학습 능력을 향상시킬 수 있습니다.

2023 년 1 월 27 일 업데이트: 거의 3 년이 지났습니다. 이 포스트에 2020 년 이후 새로운 Transformer 모델들을 통합하기 위해 대규모 리팩토링 업데이트를 수행했습니다. 해당 포스트의 향상된 버전은 The Transformer Family Version 2.0입니다. 이 주제에 대해서는 해당 포스트를 참조하세요.

The encoder generates an attention-based representation with capability to locate a specific piece of information from a large context. It consists of a stack of 6 identity modules, each containing two submodules, a multi-head self-attention layer and a point-wise fully connected feed-forward network. By point-wise, it means that it applies the same linear transformation (with same weights) to each element in the sequence. This can also be viewed as a convolutional layer with filter size 1. Each submodule has a residual connection and layer normalization. All the submodules output data of the same dimension $d$.

The function of Transformer decoder is to retrieve information from the encoded representation. The architecture is quite similar to the encoder, except that the decoder contains two multi-head attention submodules instead of one in each identical repeating module. The first multi-head attention submodule is masked to prevent positions from attending to the future.

Positional Encoding

Because self-attention operation is permutation invariant, it is important to use proper positional encodingto provide order information to the model. The positional encoding $f P \in \mathbb{R}^{L \times d}$ has the same dimension as the input embedding, so it can be added on the input directly. The vanilla Transformer considered two types of encodings:

(1) Sinusoidal positional encoding is defined as follows, given the token position $i=1,…,L$ and the dimension $Δ=1,…,d$:

In this way each dimension of the positional encoding corresponds to a sinusoid of different wavelengths in different dimensions, from $2\pi$ to $10000 \cdot 2\pi$.

(2) Learned positional encoding, as its name suggested, assigns each element with a learned column vector which encodes its absolute position (Gehring, et al. 2017).

Quick Follow-ups

Following the vanilla Transformer, Al-Rfou et al. (2018) added a set of auxiliary losses to enable training a deep Transformer model on character-level language modeling which outperformed LSTMs. Several types of auxiliary tasks are used:

Instead of producing only one prediction at the sequence end, every immediate positionis also asked to make a correct prediction, forcing the model to predict given smaller contexts (e.g. first couple tokens at the beginning of a context window). - Each intermediate Transformer layer is used for making predictions as well. Lower layers are weighted to contribute less and less to the total loss as training progresses.
Each position in the sequence can predict multiple targets, i.e. two or more predictions of the future tokens.

Adaptive Computation Time (ACT)

Adaptive Computation Time (short for ACT; Graves, 2016) is a mechanism for dynamically deciding how many computational steps are needed in a recurrent neural network. Here is a cool tutorial on ACT from distill.pub.

Let’s say, we have a RNN model $Δ$ composed of input weights $W_x$, a parametric state transition function $Γ(.)$, a set of output weights $W_y$ and an output bias $b_y$. Given an input sequence $(x_1,…,x_L)$, the output sequence $(y_1,…,y_L)$ is computed by:

ACT 는 각 입력 요소마다 가변적인 수의 단계를 수행할 수 있는 위의 RNN 설정을 가능하게 합니다. 여러 계산 단계는 중간 상태 $(s_t^1, \…, s_t^{N(t)})$ 와 출력 $(y_t^1, \…, y_t^{N(t)})$ 의 시퀀스를 생성하며 — 모든 것은 동일한 상태 전이 함수 $\mathcal{S}(.)$, 그리고 동일한 출력 가중치 $W_y$ 와 편향 $b_y$ 를 공유합니다:

여기서 $\delta_{n,1}$ 는 입력 단계가 증가되었는지를 나타내는 이진 플래그입니다.

단계 수 $N(t)$ 는 추가적인 시그모이드형 정지 단위 $h$, 관련 가중치 행렬 $W_h$ 와 편향 $b_h$ 로 결정되며, $t$-번째 입력 요소에 대해 즉시 단계 $n$ 에서 정지 확률 $p_t^n$ 을 출력합니다:

계산을 단 한 단계 후 정지할 수 있도록 허용하기 위해 ACT 는 작은 상수 $\epsilon$ (예: 0.01) 을 도입하므로, 누적 확률이 $1-\epsilon$ 보다 위에 갈 때마다 계산을 중지합니다.

여기서 $M$ 은 허용되는 즉시 단계의 개수의 상한선입니다.

최종 상태와 출력은 평균장 업데이트입니다:

각 입력에 대한 불필요한 고려를 피하기 위해 ACT 는 손실 함수에 ponder cost $\mathcal{P}(x) = \sum_{t=1}^L N(t) + R(t)$ 를 추가하여 중간 계산 단계의 수를 줄이도록 유도합니다.

개선된 주의 범위 (Attention Span)

주의 범위를 개선하는 목표는 자기 주의에 사용할 수 있는 컨텍스트가 더 길고 효율적이며 유연하도록 만드는 것입니다.

더 긴 주의 범위 (Transformer-XL)

바닐라 Transformer 는 고정되고 제한된 주의 범위를 가집니다. 모델은 각 업데이트 단계 동안 동일한 분할 내에서 다른 요소에만 주의할 수 있으며, 분리된 고정 길이 분할 사이에 정보는 흐를 수 없습니다.

이 컨텍스트 분할 은 여러 문제를 야기합니다:

모델은 매우 긴 장기 의존성을 포착할 수 없습니다.
컨텍스트가 없거나 얇을 때 각 분할의 첫 몇 토큰을 예측하기 어렵습니다.
평가는 비쌉니다. 분할이 오른쪽으로 한 칸 이동할 때마다 새로운 분할은 다시 처음부터 재처리되지만, 많은 겹친 토큰이 있습니다.

Transformer-XL (Dai et al., 2019; "XL" 은 "extra long" 을 의미) 는 두 가지 주요 수정으로 컨텍스트 분할 문제를 해결합니다:

분할 사이의 숨겨진 상태 재사용.
재사용된 상태에 적합한 새로운 위치 인코딩을 채택합니다.

숨겨진 상태 재사용 (Hidden State Reuse)

분할 간의 반복 연결은 이전 분할의 숨겨진 상태를 지속적으로 사용하여 모델에 도입됩니다.
모델의 $(\tau + 1)$-번째 분할의 $n$-번째 레이어의 숨겨진 상태를 $\mathbf{h}{\tau+1}^{(n)} \in \mathbb{R}^{L \times d}$ 로 라벨링합니다. 동일한 분장의 마지막 레이어의 숨겨진 상태 $\mathbf{h}{\tau+1}^{(n-1)}$ 외에도, 이전 분장의 동일한 레이어의 숨겨진 상태 $\mathbf{h}_{\tau}^{(n)}$ 에 의존합니다. 이전 숨겨진 상태에서 정보를 통합함으로써, 모델은 여러 분장을 넘어 과거에 더 긴 주의 범위를 확장합니다.

키와 값은 확장된 숨겨진 상태에 의존하며, 쿼리는 현재 단계의 숨겨진 상태만 소비합니다. 연결 연산 $[. \circ .]$ 은 시퀀스 길이 차원입니다.

상대 위치 인코딩 (Relative Positional Encoding)

이 새로운 형태의 주의 범위를 위해 Transformer-XL 는 새로운 유형의 위치 인코딩을 제안했습니다. 바닐라 Transformer 와 동일한 접근법을 사용하여 절대 위치를 인코딩하면 이전과 현재 분할은 동일한 인코딩을 할당받게 되며, 이는 원하지 않습니다.

위치 정보 흐름을 세그먼트 간에 일관되게 유지하기 위해 Transformer-XL 는 상대적 위치를 인코딩합니다. 이는 좋은 예측을 위한 위치 오프셋 $i-j$ 를 알기에 충분할 수 있기 때문입니다, 즉 키 벡터 $f{k}{ au, j}$ 와 그 쿼리 $f{q}{ au, i}$ 사이의 위치 오프셋입니다.

스칼라 $1/f{f d_k}$ 과 softmax 의 정규화 항을 생략하고 위치 인코딩은 포함한다면, 위치 $i$ 의 쿼리와 위치 $j$ 의 키 간의 attention score 를 다음과 같이 쓸 수 있습니다:

Transformer-XL 는 위의 네 가지 항을 다음과 같이 재표현합니다:

$f{p}j$ 를 상대적 위치 인코딩 $f{r}{i-j} f{f R}^{d}$로 대체;
$f{p}_if{W}^q$ 를 두 다른 항에 있는 훈련 가능한 매개변수 $f{u}$ (콘텐츠용) 와 $f{v}$ (위치용) 로 대체;
$f{W}^k$ 를 콘텐츠 정보용 행렬 $f{W}^k_E$ 와 위치 정보용 행렬 $f{W}^k_R$ 으로 두 개의 행렬로 분할.

Adaptive Attention Span

Transformer 의 주요 장점 중 하나는 장기 의존성을 포착하는 능력입니다. 컨텍스트에 따라 모델은 때로는 더 멀리 attention 을 할지, 다른 때는 그렇지 않을 수 있습니다; 또는 하나의 attention head 가 다른 head 와 다른 attention 패턴을 가질 수 있습니다. attention span 이 유연하게 길이를 조절하고 필요할 때만 더 멀리 뒤로 attention 을 할 수 있다면, 이는 모델의 최대 컨텍스트 크기를 지원하기 위해 계산 및 메모리 비용을 모두 줄이는 데 도움이 될 것입니다.

이것이 Adaptive Attention Span 의 동기입니다. Sukhbaatar, et al., (2019) 는 최적의 attention span 을 추구하는 자기 attention mechanism 을 제안했습니다. 그들은 다른 attention head 가 동일한 컨텍스트 윈도우 내에서 점수를 다르게 할 수 있다고 가정하고 (Fig. 7 참조), 따라서 최적의 span 은 각 head 에 대해 별도로 훈련될 것이라고 했습니다.

$i$-th 토큰이 주어졌을 때, 이 토큰과 위치 $j f{S_i}$ 의 다른 키들 간의 attention weights 를 계산해야 합니다, 여기서 $f{S_i}$ 는 $i$-th 토큰의 컨텍스트 윈도우를 정의합니다.

soft mask function $m_z$ 가 추가되어 효과적인 조절 가능한 attention span 을 제어하며, 쿼리와 키 사이의 거리를 [0, 1] 값으로 매핑합니다. $m_z$ 는 $z f{[0, s]}$ 로 파라미터화되며 $z$ 는 학습됩니다:

$$
egin{align}
m_z = \exp(-R|z|^2)
\end{align}
$$

여기서 $R$ 은 $m_z$ 의 softness 를 정의하는 하이퍼파라미터입니다.

Soft mask function 은 attention weights 의 softmax elements 에 적용됩니다:

$$
egin{align}
ext{Attention}(Q, K, V) &= \frac{f{exp}(f{QK^T})}{\sum_j \bf{exp}(f{QK^T}_j)}
\end{align}
$$

위 식에서 $z$ 는 미분 가능하므로 모델의 다른 부분과 함께 훈련됩니다. 매개변수 $z^{(i)}, i=1, f{f h}$ 는 각 head 에 대해 별도로 학습됩니다. 또한 loss function 은 $f{\sum_{i=1}^h z^{(i)}}$ 에 대한 추가 L1 penalty 를 가집니다.

Adaptive Computation Time 을 사용하면 접근법은 더 유연한 attention span 길이를 갖도록 향상될 수 있으며, 현재 입력에 동적으로 적응할 수 있습니다. 시간 $t$ 의 attention head 의 span 매개변수 $z_t$ 는 sigmoidal function 이며, $z_t = S \sigma(f{v} f{f x}_t +b)$ 입니다, 여기서 벡터 $f{v}$ 와 bias 스칼라 $b$ 는 다른 매개변수와 함께 학습됩니다.

Transformer 의 adaptive attention span 실험에서 Sukhbaatar, et al. (2019) 은 낮은 층은 매우 긴 attention span 을 필요로 하지 않으며 몇 개의 attention head 가 exceptionally 긴 span 을 사용할 수 있다는 일반적인 경향을 발견했습니다. Adaptive attention span 은 또한 FLOPS 수를 크게 줄이는 데 큰 도움이 되며, 특히 많은 attention layer 와 큰 컨텍스트 길이를 가진 큰 모델에서 더욱 그렇습니다.

Localized Attention Span (Image Transformer)

AI 자동 생성 콘텐츠

원문 바로가기