Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

LLM Basics

LLM ๊ธฐ์ดˆ โ€” ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์˜ ์›๋ฆฌ

Tokenization ยท Transformer ยท Self-Attention ยท Training ยท Inference & KV Cache

๊ฐœ์š”

LLM(Large Language Model, ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ)์€ ์ง€๊ธˆ๊นŒ์ง€์˜ ํ† ํฐ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋‹ค์Œ ํ† ํฐ์˜ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด ๋‹จ์ˆœํ•œ ๋ชฉ์ ์„ ๊ฑฐ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์™€ ์ถฉ๋ถ„ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทœ๋ชจ๋กœ ๋ฐ€์–ด๋ถ™์ด๋ฉด ๋ฒˆ์—ญ, ์š”์•ฝ, ์ฝ”๋”ฉ, ์งˆ์˜์‘๋‹ต ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์ž‘์—…์ด ํ•œ ๋ชจ๋ธ ์•ˆ์—์„œ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

์‹ค์ œ ์‹œ์Šคํ…œ ๊ด€์ ์—์„œ LLM์€ ํ•™์Šต๋ณด๋‹ค ์ถ”๋ก ์—์„œ ๋” ๋ถ„๋ช…ํ•œ ๋ณ‘๋ชฉ์„ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค. prefill์€ ํฐ ํ–‰๋ ฌ๊ณฑ์ด ์ค‘์‹ฌ์ธ ์—ฐ์‚ฐ ์ง‘์•ฝ ๋‹จ๊ณ„์ด๊ณ , decode๋Š” KV cache๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ์ฝ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ LLM ๊ธฐ์ดˆ๋ฅผ ์ดํ•ดํ•  ๋•Œ๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ๋ฟ ์•„๋‹ˆ๋ผ KV cache, ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ, ์„œ๋น™ ๋ฐฉ์‹๊นŒ์ง€ ํ•จ๊ป˜ ๋ณด๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

llm_0000_llm_basics

๊ทธ๋ฆผ 1. ์ž…๋ ฅ โ†’ ํ† ํฐํ™” โ†’ ์ž„๋ฒ ๋”ฉ โ†’ Transformer โ†’ ํ™•๋ฅ ๋ถ„ํฌ โ†’ ์ƒ˜ํ”Œ๋ง์˜ ์ž๊ธฐํšŒ๊ท€ ๋ฐ˜๋ณต

1. ํ•ต์‹ฌ ๊ฐœ๋…

ํ† ํฐ๊ณผ ํ† ํฐํ™”

  • ํ† ํฐ(token)์€ ํ…์ŠคํŠธ๋ฅผ ๋‚˜๋ˆˆ ๋‹จ์œ„์ž…๋‹ˆ๋‹ค. ์‹ค์ œ LLM์€ ๋‹จ์–ด๋ณด๋‹ค ์ž‘์€ ์„œ๋ธŒ์›Œ๋“œ(subword)๋ฅผ ์ž์ฃผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด running์€ run๊ณผ ning์ฒ˜๋Ÿผ ๋‚˜๋‰  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ† ํฐํ™”๋Š” ๋ณดํ†ต BPE(Byte-Pair Encoding)๋‚˜ ๊ทธ ๋ณ€ํ˜•(WordPiece, SentencePiece)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ ํ† ํฐ์€ ์ •์ˆ˜ ID๋กœ ๋ฐ”๋€Œ์–ด ๋ชจ๋ธ์— ๋“ค์–ด๊ฐ€๊ณ , ๋ชจ๋ธ ์ถœ๋ ฅ์€ ์–ดํœ˜ ์ „์ฒด์— ๋Œ€ํ•œ ๋‹ค์Œ ํ† ํฐ ํ™•๋ฅ ์ž…๋‹ˆ๋‹ค.

Transformer

์˜ค๋Š˜๋‚  ๋Œ€๋ถ€๋ถ„์˜ LLM์€ Transformer๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. GPT ๊ณ„์—ด์€ decoder-only ๊ตฌ์กฐ๋กœ, ๋™์ผํ•œ Transformer ๋ธ”๋ก์„ ์—ฌ๋Ÿฌ ์ธต ์Œ“์•„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ž…๋ ฅ ํ† ํฐ์„ ๋ฒกํ„ฐ๋กœ ๋ฐ”๊พธ๊ณ , ๋ธ”๋ก์„ ํ†ต๊ณผ์‹œ์ผœ ๋ฌธ๋งฅ์ด ๋ฐ˜์˜๋œ ํ‘œํ˜„์„ ๋งŒ๋“  ๋’ค, ๋งˆ์ง€๋ง‰์— ๋‹ค์Œ ํ† ํฐ ์ ์ˆ˜(logits)๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  • Embedding: ํ† ํฐ ID๋ฅผ d_model ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์œ„์น˜ ์ •๋ณด: ํ† ํฐ ์ˆœ์„œ๋ฅผ ์•Œ๋ ค์ฃผ๋Š” ์‹ ํ˜ธ์ž…๋‹ˆ๋‹ค. ์›์กฐ Transformer๋Š” sinusoidal positional encoding์„, ํ˜„๋Œ€ LLM์€ RoPE๋‚˜ ALiBi๋ฅผ ๋งŽ์ด ์”๋‹ˆ๋‹ค.
  • ๋ธ”๋ก: LayerNorm -> Self-Attention -> Residual๊ณผ LayerNorm -> FFN -> Residual๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
  • FFN: ํ† ํฐ๋ณ„ ๋น„์„ ํ˜• ๋ณ€ํ™˜์„ ๋‹ด๋‹นํ•˜๋Š” 2์ธต MLP์ž…๋‹ˆ๋‹ค. ์ตœ์‹  ๋ชจ๋ธ์€ SwiGLU ๊ฐ™์€ ๊ฒŒ์ดํŠธํ˜• ๋ณ€ํ˜•์„ ์ž์ฃผ ์”๋‹ˆ๋‹ค.
  • LM Head: ๋งˆ์ง€๋ง‰ ํ‘œํ˜„์„ ์–ดํœ˜ ํฌ๊ธฐ๋งŒํผ์˜ ์ ์ˆ˜๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค. ์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณต์œ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

์ž๊ธฐํšŒ๊ท€ ์ƒ์„ฑ

LLM์€ ํ•œ ๋ฒˆ์˜ forward๋กœ ๋‹ค์Œ ํ† ํฐ 1๊ฐœ์˜ ๋ถ„ํฌ๋ฅผ ๋‚ด๋†“๊ณ , ๊ทธ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•ด ๋’ค์— ๋ถ™์ธ ๋’ค ๋‹ค์‹œ ์˜ˆ์ธกํ•˜๋Š” ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐ˜๋ณต์ด ์ž๊ธฐํšŒ๊ท€ ์ƒ์„ฑ์ž…๋‹ˆ๋‹ค.

์Šค์ผ€์ผ๋ง๊ณผ ์ถ”๋ก  ์ง€ํ‘œ

  • LLM ์„ฑ๋Šฅ์€ ๋Œ€์ฒด๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜, ํ•™์Šต ํ† ํฐ ์ˆ˜, ํ•™์Šต compute๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๊ฐœ์„ ๋˜๋Š” ์Šค์ผ€์ผ๋ง ๋ฒ•์น™ ๊ฒฝํ–ฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๊ฐ™์€ ์„ฑ๋Šฅ์„ ๋” ๋‚ฎ์€ ๋น„์šฉ์œผ๋กœ ์–ป๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ, optimizer ์„ค์ •, ๋ณ‘๋ ฌํ™” ์ „๋žต๊นŒ์ง€ ํ•จ๊ป˜ ์กฐ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ถ”๋ก  ์‹œ์Šคํ…œ์—์„œ๋Š” TTFT(Time To First Token)์™€ ํ† ํฐ๋‹น ์ง€์—ฐ ์‹œ๊ฐ„์ด ๋Œ€ํ‘œ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค. prefill ์ตœ์ ํ™”๋Š” TTFT๋ฅผ ์ค„์ด๊ณ , decode ์ตœ์ ํ™”๋Š” ์ดˆ๋‹น ์ƒ์„ฑ ํ† ํฐ ์ˆ˜๋ฅผ ๋†’์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
  • ์ตœ๊ทผ ๋ชจ๋ธ์€ MHA ๋Œ€์‹  GQA(Grouped Query Attention)๋‚˜ MQA(Multi-Query Attention)๋ฅผ ์จ์„œ KV cache ํฌ๊ธฐ์™€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์š”๊ตฌ๋ฅผ ์ค„์ด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

2. ๋น„๊ต/๋ถ„์„

๊ตฌ๋ถ„ Prefill Decode
์ž…๋ ฅ ํ”„๋กฌํ”„ํŠธ ์ „์ฒด ์ƒˆ ํ† ํฐ 1๊ฐœ
์—ฐ์‚ฐ ํŠน์„ฑ ํฐ ํ–‰๋ ฌ๊ณฑ ์ค‘์‹ฌ ์ž‘์€ ์—ฐ์‚ฐ ๋ฐ˜๋ณต
๋ณ‘๋ชฉ compute-bound memory-bound
KV cache ํ•œ ๋ฒˆ ๊ณ„์‚ฐํ•ด ์ €์žฅ ์ €์žฅ๋œ KV๋ฅผ ์žฌ์‚ฌ์šฉ
์ฒด๊ฐ ์˜ํ–ฅ TTFT์— ์ง์ ‘ ์˜ํ–ฅ ์ƒ์„ฑ ์†๋„์— ์ง์ ‘ ์˜ํ–ฅ
์ƒ˜ํ”Œ๋ง ๋™์ž‘ ํŠน์ง•
Greedy ๊ฐ€์žฅ ํ™•๋ฅ  ๋†’์€ ํ† ํฐ ์„ ํƒ ๊ฒฐ์ •์ ์ด์ง€๋งŒ ๋ฐ˜๋ณต์ ์ผ ์ˆ˜ ์žˆ์Œ
Temperature ๋ถ„ํฌ์˜ ๋‚ ์นด๋กœ์›€ ์กฐ์ ˆ ๋†’์„์ˆ˜๋ก ๋‹ค์–‘์„ฑ ์ฆ๊ฐ€
Top-k ์ƒ์œ„ k๊ฐœ ํ›„๋ณด๋งŒ ์„ ํƒ ์ด์ƒํ•œ ํ† ํฐ์„ ์ค„์ž„
Top-p (nucleus) ๋ˆ„์ ํ™•๋ฅ  p๊นŒ์ง€ ํ›„๋ณด ์‚ฌ์šฉ ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ํ›„๋ณด ์ˆ˜๊ฐ€ ๋ณ€ํ•จ

3. ๋™์ž‘ ์›๋ฆฌ

Self-Attention

๊ฐ ํ† ํฐ์€ Query(Q), Key(K), Value(V) ์„ธ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ , Q์™€ K์˜ ๋‚ด์ ์œผ๋กœ ํ† ํฐ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•œ ๋’ค V๋ฅผ ๊ฐ€์ค‘ํ•ฉํ•ฉ๋‹ˆ๋‹ค. decoder-only LLM์—์„œ๋Š” causal mask๋กœ ๋ฏธ๋ž˜ ํ† ํฐ์„ ๋ง‰์Šต๋‹ˆ๋‹ค.

Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k) + mask) V
scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
scores = scores.masked_fill(causal_mask, float('-inf'))
weights = softmax(scores, dim=-1)
output = weights @ V

ํ•™์Šต ๋‹จ๊ณ„

  • Pretraining: ๋ผ๋ฒจ ์—†๋Š” ๋ฐฉ๋Œ€ํ•œ ํ…์ŠคํŠธ๋กœ ๋‹ค์Œ ํ† ํฐ ์˜ˆ์ธก์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • SFT: ์‚ฌ๋žŒ์ด ๋งŒ๋“  ์ง€์‹œ-์‘๋‹ต ์Œ์œผ๋กœ ๋Œ€ํ™”์™€ ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ๋ฅผ ์ตํž™๋‹ˆ๋‹ค.
  • Alignment: RLHF๋‚˜ DPO ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๋„์›€๋จ, ์•ˆ์ „์„ฑ, ์ •์ง์„ฑ์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

KV Cache

์ž๊ธฐํšŒ๊ท€ ์ƒ์„ฑ์€ ๋งค ์Šคํ… ๊ณผ๊ฑฐ ์ „์ฒด์— attentionํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ๊ณผ๊ฑฐ K/V๋ฅผ ์ €์žฅํ•˜์ง€ ์•Š์œผ๋ฉด ์žฌ๊ณ„์‚ฐ ๋น„์šฉ์ด ์ปค์ง‘๋‹ˆ๋‹ค. KV cache๋Š” ๊ฐ ํ† ํฐ์˜ K/V๋ฅผ ํ•œ ๋ฒˆ๋งŒ ๊ณ„์‚ฐํ•ด ์ €์žฅํ•˜๊ณ  ์žฌ์‚ฌ์šฉํ•ด decode๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

KV cache๋Š” ์‹œํ€€์Šค ๊ธธ์ด์™€ ๋ฐฐ์น˜์— ๋น„๋ก€ํ•ด ์ปค์ง‘๋‹ˆ๋‹ค. ๋Œ€๋žต์ ์œผ๋กœ๋Š” KV_bytes โ‰ˆ 2 ร— L ร— d_model ร— seq_len ร— batch ร— dtype_bytes๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด L=80, d_model=8192, seq_len=8192, batch=1, FP16์ด๋ฉด ํ•œ ์‹œํ€€์Šค๋งŒ์œผ๋กœ๋„ ์•ฝ 21GB ์ˆ˜์ค€์ด ๋ฉ๋‹ˆ๋‹ค.

4. ์žฅ๋‹จ์ 

์žฅ์  ๋‹จ์ 
๋ฒ”์šฉ์„ฑ์ด ๋†’๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ๊ฐ€ ๋งค์šฐ ํฌ๋‹ค
๋ณ‘๋ ฌํ™”๊ฐ€ ์ž˜ ๋œ๋‹ค attention์ด ๊ธธ์ด์— ๋”ฐ๋ผ O(nยฒ)๋กœ ์ปค์ง„๋‹ค
ํ”„๋กฌํ”„ํŠธ๋งŒ ๋ฐ”๊ฟ” ๋‹ค์–‘ํ•œ ์ž‘์—…์— ์ ์šฉ ๊ฐ€๋Šฅ ์ถ”๋ก  ์‹œ KV cache๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ์ด ๋œ๋‹ค
์„œ๋น™ ์ตœ์ ํ™” ์—ฌ์ง€๊ฐ€ ๋งŽ๋‹ค ๊ธด ๋ฌธ๋งฅยทํฐ ๋ฐฐ์น˜์—์„œ ๋น„์šฉ์ด ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•œ๋‹ค

5. ๊ด€๋ จ ๊ธฐ์ˆ 

๋‚ด๋ถ€ ๋ฌธ์„œ

์ฃผ์š” ์›๋ฌธ

์ž๋ฃŒ ํ•ต์‹ฌ
Vaswani et al., 2017, Attention Is All You Need Transformer์˜ ์ถœ๋ฐœ์ 
Brown et al., 2020, Language Models are Few-Shot Learners ๋Œ€๊ทœ๋ชจ ์ž๊ธฐํšŒ๊ท€ ์–ธ์–ด ๋ชจ๋ธ์˜ ํ™•์žฅ
Dao et al., 2022, FlashAttention IO-aware exact attention
Kwon et al., 2023, PagedAttention KV cache๋ฅผ ํŽ˜์ด์ง€์ฒ˜๋Ÿผ ๊ด€๋ฆฌ
Kaplan et al., 2020, Scaling Laws for Neural Language Models ๋ชจ๋ธ ํฌ๊ธฐ, ๋ฐ์ดํ„ฐ, compute์™€ ์„ฑ๋Šฅ์˜ ์Šค์ผ€์ผ๋ง ๊ด€๊ณ„

6. ํ•ต์‹ฌ ์ •๋ฆฌ

LLM์€ ๋‹ค์Œ ํ† ํฐ์˜ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” Transformer ๊ธฐ๋ฐ˜ ์ž๊ธฐํšŒ๊ท€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ† ํฐํ™”, ์ž„๋ฒ ๋”ฉ, self-attention, FFN, LM head๊ฐ€ ๊ธฐ๋ณธ ํ๋ฆ„์„ ์ด๋ฃน๋‹ˆ๋‹ค.

ํ•™์Šต์€ pretraining, SFT, alignment๋กœ ์ด์–ด์ง€๊ณ , ๊ทœ๋ชจ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์€ ๋Œ€์ฒด๋กœ ๋ฉฑ๋ฒ•์น™์— ๋”ฐ๋ผ ์ข‹์•„์ง‘๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์ถ”๋ก ์—์„œ๋Š” prefill๊ณผ decode์˜ ๋ณ‘๋ชฉ์ด ๋‹ฌ๋ผ์ง€๋ฉฐ, ํŠนํžˆ decode๋Š” KV cache ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์ค‘์š”ํ•ด์ง‘๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ LLM ์‹œ์Šคํ…œ์—์„œ๋Š” FlashAttention, PagedAttention, KV cache ์–‘์žํ™”, GQA/MQA, ์˜คํ”„๋กœ๋”ฉ ๊ฐ™์€ ๊ธฐ๋ฒ•์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ์ด ๋ฌธ์„œ์˜ ํ•ต์‹ฌ์€ ๋ชจ๋ธ ๊ตฌ์กฐ๋ณด๋‹ค๋„, ์ถ”๋ก  ๋‹จ๊ณ„์—์„œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์™œ ๋ณ‘๋ชฉ์ด ๋˜๋Š”์ง€ ํ•จ๊ป˜ ๋ณด๋Š” ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค.