Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

StreamingLLM Analysis

StreamingLLM / Attention Sinks ์‹ฌ์ธต ๋ถ„์„

๋ฌดํ•œ ์ŠคํŠธ๋ฆฌ๋ฐ ยท Attention Sink ยท Sink + Window ยท Rolling Cache ยท KV ์ถ•์ถœ(H2O) ยท '๋ฌธ๋งฅ ํ™•์žฅ์ด ์•„๋‹˜'

StreamingLLM์€ ์œ ํ•œํ•œ ์–ดํ…์…˜ ์œˆ๋„์šฐ๋กœ ํ•™์Šต๋œ LLM์ด fine-tuning ์—†์ด ๋ฌดํ•œ ๊ธธ์ด ์ž…๋ ฅ์„ ์•ˆ์ •์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ 'attention sink' ํ˜„์ƒ์˜ ๋ฐœ๊ฒฌ์ž…๋‹ˆ๋‹ค โ€” ์ดˆ๊ธฐ ํ† ํฐ๋“ค์ด ์˜๋ฏธ์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋ง‰๋Œ€ํ•œ ์–ดํ…์…˜์„ ๋ฐ›๋Š”๋ฐ, ์ด๋Š” softmax๊ฐ€ ํ•ฉ์„ 1๋กœ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์— ์ž‰์—ฌ ์–ดํ…์…˜์„ ํ•ญ์ƒ ๋ณด์ด๋Š” ์ดˆ๊ธฐ ํ† ํฐ์— ๋คํ”„ํ•˜๋Š” ๋ฐ์„œ ๋น„๋กฏ๋ฉ๋‹ˆ๋‹ค. StreamingLLM์€ ์ดˆ๊ธฐ ํ† ํฐ ๋ช‡ ๊ฐœ(sink)์™€ ์ตœ๊ทผ sliding window๋ฅผ ํ•จ๊ป˜ ์œ ์ง€ํ•จ์œผ๋กœ์จ ์–ดํ…์…˜์„ ์•ˆ์ •ํ™”ํ•˜๊ณ , KV ์บ์‹œ๋ฅผ ์‹œํ€€์Šค ๊ธธ์ด์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ผ์ • ํฌ๊ธฐ๋กœ ๋ฌถ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์ด๋Š” ๋ฌธ๋งฅ ๊ธธ์ด 'ํ™•์žฅ'์ด ์•„๋‹ˆ๋ผ ๋ฌดํ•œ '์ŠคํŠธ๋ฆฌ๋ฐ'์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜๋Š” ๊ฒƒ์ด๋ผ๋Š” ์ ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ฌธ์„œ๋Š” ๋ฌธ์ œ โ†’ attention sink ๋ฐœ๊ฒฌ โ†’ ํ•ด๋ฒ• โ†’ ๊ฒฐ๊ณผยทKV ์ถ•์ถœ ๋น„๊ต โ†’ ์ค‘์š”ํ•œ ์˜คํ•ด์™€ KV ์ ˆ๊ฐ์—์„œ์˜ ์œ„์น˜์˜ ์ˆœ์„œ๋กœ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

์›๋…ผ๋ฌธ์€ Llama-2, MPT, Falcon, Pythia์—์„œ ์ตœ๋Œ€ 4M ํ† ํฐ ์ด์ƒ์„ ์•ˆ์ •์ ์œผ๋กœ ์ฒ˜๋ฆฌํ–ˆ๊ณ , sliding window ์žฌ๊ณ„์‚ฐ ๋Œ€๋น„ ์ตœ๋Œ€ 22.2๋ฐฐ ์†๋„๋ฅผ ๋ณด๊ณ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ pre-training ๋‹จ๊ณ„์—์„œ placeholder token์„ ์ „์šฉ sink๋กœ ๋‘๋Š” ๋ณ€ํ˜•๋„ ์ œ์•ˆํ•ด, ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐฐํฌ ์‹œ์ ์„ ๋” ์•ž๋‹น๊ธธ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

StreamingLLM์˜ ํ•ต์‹ฌ์€ '๋ฒ„๋ฆด ํ† ํฐ'์„ ๋ฌด์ž‘์ • ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ดˆ๊ธฐ ๋ช‡ ๊ฐœ์˜ ํ† ํฐ์„ sink๋กœ ๋‚จ๊ฒจ ์–ดํ…์…˜ ๋ถ„ํฌ์˜ ์™„์ถฉ์žฌ๋กœ ์“ฐ๋Š” ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ ํ† ํฐ๋งŒ ์œ ์ง€ํ•˜๋Š” sliding window์™€ ๋‹ฌ๋ฆฌ, ์ด ๋ฐฉ์‹์€ ์ดˆ๊ธฐ ํ† ํฐ์„ ๊ณ ์ •์œผ๋กœ ๋ณด์กดํ•ด softmax๊ฐ€ ๋งŒ๋“ค์–ด๋‚ด๋Š” ์ž‰์—ฌ ์–ดํ…์…˜์˜ ํ˜๋Ÿฌ๊ฐˆ ์ž๋ฆฌ๋ฅผ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ตฌ์กฐ๋Š” ๊ธฐ์กด full-attention ๋ชจ๋ธ์„ ๋‹ค์‹œ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ ๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ธด ์ŠคํŠธ๋ฆฌ๋ฐ ์ž…๋ ฅ์—์„œ KV cache ํฌ๊ธฐ๋ฅผ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€ํ•˜๋Š” ๋ฐ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ window ๋ฐ–์˜ ๊ณผ๊ฑฐ๋ฅผ ์ง์ ‘ ๊ธฐ์–ตํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋Š˜๋ ค ์ฃผ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋ฏ€๋กœ, ์žฅ๊ฑฐ๋ฆฌ ์ถ”๋ก ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ์—๋Š” context extension์ด๋‚˜ retrieval ๊ธฐ๋ฒ•๊ณผ ํ•จ๊ป˜ ๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋น„๊ต/๋ถ„์„

๊ธฐ๋ฒ• ์œ ์ง€ํ•˜๋Š” KV ์ž‘๋™ ๋ฐฉ์‹ ์žฅ์  ํ•œ๊ณ„
StreamingLLM ์ดˆ๊ธฐ sink + ์ตœ๊ทผ window ์ถ”๋ก  ์‹œ KV ์ถ•์ถœ ์ •์ฑ… ๋ฌดํ•™์Šต, bounded memory ๋จผ ๊ณผ๊ฑฐ recall ๋ถˆ๊ฐ€
H2O heavy hitter + ์ตœ๊ทผ ํ† ํฐ attention score ๋ˆ„์  ๊ธฐ๋ฐ˜ ์ถ•์ถœ ์ค‘์š” ํ† ํฐ์„ ๋™์ ์œผ๋กœ ๋ณด์กด ์ ์ˆ˜ ๊ณ„์‚ฐ ๋น„์šฉ ํ•„์š”
Sliding window ์žฌ๊ณ„์‚ฐ ์ตœ๊ทผ window๋งŒ, ๋งค๋ฒˆ ์žฌ๊ณ„์‚ฐ ์ •ํ™•ํ•œ window attention ํ’ˆ์งˆ ์•ˆ์ •์„ฑ ๋†’์Œ O(T^2) ๋น„์šฉ ํผ

๋™์ž‘ ์›๋ฆฌ

  1. ์ดˆ๊ธฐ ํ† ํฐ ๋ช‡ ๊ฐœ๋ฅผ sink๋กœ ๊ณ ์ •ํ•ฉ๋‹ˆ๋‹ค.
  2. ์ƒˆ ํ† ํฐ์ด ์˜ค๋ฉด ์ตœ๊ทผ window๋ฅผ ๊ฐฑ์‹ ํ•˜๊ณ  ๊ฐ€์žฅ ์˜ค๋ž˜๋œ ํ† ํฐ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
  3. ์บ์‹œ ์•ˆ์˜ ์œ„์น˜๋ฅผ ์ƒ๋Œ€ ์œ„์น˜๋กœ ๋‹ค์‹œ ๋งค๊ฒจ ํ•™์Šต ๊ธธ์ด ๋ฐ–์œผ๋กœ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  4. sink๊ฐ€ ์–ดํ…์…˜์˜ ๋ฐฐ์ˆ˜๊ตฌ ์—ญํ• ์„ ํ•ด ๋ถ„ํฌ ๋ถ•๊ดด๋ฅผ ๋ง‰์Šต๋‹ˆ๋‹ค.

์žฅ๋‹จ์ 

  • ์žฅ์ : ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ผ์ •ํ•˜๊ณ , fine-tuning ์—†์ด ์ ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ธด ์ŠคํŠธ๋ฆฌ๋ฐ์—์„œ ์ง€์—ฐ๊ณผ OOM ์œ„ํ—˜์„ ์ค„์ž…๋‹ˆ๋‹ค.
  • ํ•œ๊ณ„: window ๋ฐ– ์ •๋ณด๋Š” ์‚ฌ๋ผ์ง€๋ฏ€๋กœ ๊ธด ๋ฌธ์„œ ์งˆ์˜์‘๋‹ต์ฒ˜๋Ÿผ ๋จผ ๊ณผ๊ฑฐ recall์ด ์ค‘์š”ํ•œ ์ž‘์—…์—๋Š” ๋ถˆ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๊ด€๋ จ ๊ธฐ์ˆ 

  • StreamingLLM: Learning to Extend LLM Context with Attention Sinks (ICLR 2024, arXiv 2309.17453)
  • H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (arXiv 2306.14048)
  • Mistral 7B์˜ sliding window attention (arXiv 2310.06825)
  • Sliding Window Attention Analysis (llm_0060_sliding_window_attention_analysis.html)
  • KV Cache Offloading Analysis (llm_0040_kv_cache_offloading_analysis.html)
  • Memory Centric LLM Serving Survey (llm_0003_memory_centric_llm_serving_survey.html)

ํ•ต์‹ฌ ์ •๋ฆฌ

StreamingLLM์€ attention sink์™€ ์ตœ๊ทผ window๋ฅผ ๊ฒฐํ•ฉํ•ด ๊ธฐ์กด ๋ชจ๋ธ์˜ KV cache๋ฅผ ์ผ์ • ํฌ๊ธฐ๋กœ ์œ ์ง€ํ•˜๋Š” ์ŠคํŠธ๋ฆฌ๋ฐ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ์ดˆ๊ธฐ ํ† ํฐ์ด ์–ดํ…์…˜์˜ ์™„์ถฉ์žฌ์ฒ˜๋Ÿผ ์ž‘๋™ํ•œ๋‹ค๋Š” ๊ด€์ฐฐ์— ๊ธฐ๋ฐ˜ํ•˜๋ฉฐ, window attention์ด ์ดˆ๊ธฐ ํ† ํฐ์„ ์žƒ์„ ๋•Œ ๋ถ•๊ดดํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€์‹  ๋จผ ๊ณผ๊ฑฐ๋ฅผ ์ง์ ‘ ๊ธฐ์–ตํ•˜๋Š” ๋ฌธ๋งฅ ํ™•์žฅ๊ณผ๋Š” ๋‹ค๋ฅด๋ฏ€๋กœ, ์žฅ๊ฑฐ๋ฆฌ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•˜๋ฉด ๋‹ค๋ฅธ context extension์ด๋‚˜ retrieval ๊ธฐ๋ฒ•๊ณผ ํ•จ๊ป˜ ์จ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

1. ๋ฌธ์ œ โ€” ๋ฌดํ•œ ์ŠคํŠธ๋ฆฌ๋ฐ์—์„œ ๋‘ ์ˆœ์ง„ํ•œ ๋ฐฉ๋ฒ•์ด ์‹คํŒจ

๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™”์ฒ˜๋Ÿผ ์ž…๋ ฅ์ด ๋์—†์ด ์ด์–ด์ง€๋Š” ์ŠคํŠธ๋ฆฌ๋ฐ ์‘์šฉ์—์„œ๋Š” ๋‘ ๊ฐ€์ง€ ๋ถ€๋‹ด์ด ์ƒ๊น๋‹ˆ๋‹ค. ๊ณผ๊ฑฐ ํ† ํฐ์˜ KV ์บ์‹œ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์žก์•„๋จน๊ณ , ํ•™์Šต ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ๋„˜๋Š” ์ž…๋ ฅ์— ๋ชจ๋ธ์ด ์ผ๋ฐ˜ํ™”ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ๋‘ ์ ‘๊ทผ์ด ๋ชจ๋‘ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค.

StreamingLLM ๋ฌธ์ œ์™€ ์ดˆ๊ธฐ ํ† ํฐ ๋ณด์กด ํšจ๊ณผ

๊ทธ๋ฆผ 1. dense attention๊ณผ window attention์ด ๊ฐ๊ฐ ์‹คํŒจํ•˜๋Š” ์ด์œ , ๊ทธ๋ฆฌ๊ณ  ๊ฒฐ์ •์  ์ฆ๊ฑฐ

๋‘ ์ ‘๊ทผ์˜ ์‹คํŒจ

  • Dense attention โ€” ๋ชจ๋“  ๊ณผ๊ฑฐ ํ† ํฐ์˜ KV๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. O(Tยฒ) ์—ฐ์‚ฐยท๋ฉ”๋ชจ๋ฆฌ๋กœ ๋น„์‹ธ๊ณ , ํ•™์Šต ๊ธธ์ด๋ฅผ ๋„˜์œผ๋ฉด perplexity๊ฐ€ ํญ๋ฐœํ•ฉ๋‹ˆ๋‹ค.

  • Window attention โ€” ์ตœ๊ทผ KV๋งŒ ์บ์‹œํ•ฉ๋‹ˆ๋‹ค(ํšจ์œจ์ ). ๊ทธ๋Ÿฌ๋‚˜ ์ดˆ๊ธฐ ํ† ํฐ์ด ์บ์‹œ์—์„œ ๋ฐ€๋ ค๋‚˜๋Š” ์ˆœ๊ฐ„ perplexity๊ฐ€ ํญ๋ฐœํ•ฉ๋‹ˆ๋‹ค.

  • ๊ฒฐ์ •์  ์ฆ๊ฑฐ โ€” Llama-2-13B๋ฅผ PG19์—์„œ ์ธก์ •ํ•˜๋ฉด, ์œˆ๋„์šฐ(0+1024)๋Š” PPL 5158๋กœ ๋ถ•๊ดดํ•˜์ง€๋งŒ ์ดˆ๊ธฐ 4ํ† ํฐ์„ ๋˜์‚ด๋ฆฌ๋ฉด(4+1020) PPL 5.40์œผ๋กœ ๋ณต์›๋ฉ๋‹ˆ๋‹ค. ์ฆ‰ ๋ถ•๊ดด์˜ ์›์ธ์€ '์ตœ๊ทผ ํ† ํฐ'์ด ์•„๋‹ˆ๋ผ '์ดˆ๊ธฐ ํ† ํฐ'์„ ์žƒ์€ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ •ํ™•ํ•œ ๋Œ€์•ˆ์ธ sliding window + ์žฌ๊ณ„์‚ฐ์€ ๋งค๋ฒˆ ๋‹ค์‹œ ๊ณ„์‚ฐํ•ด ์ •ํ™•ํ•˜์ง€๋งŒ O(Tยฒ)๋กœ ๋งค์šฐ ๋А๋ฆฝ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์™œ ์ดˆ๊ธฐ ํ† ํฐ์ด ๊ทธ๋ ‡๊ฒŒ ์ค‘์š”ํ• ๊นŒ์š”? ๊ทธ ๋‹ต์ด attention sink์ž…๋‹ˆ๋‹ค.

2. ํ•ต์‹ฌ ๋ฐœ๊ฒฌ โ€” Attention Sink ํ˜„์ƒ

StreamingLLM์˜ ์ถœ๋ฐœ์ ์€ ํ•˜๋‚˜์˜ ๊ด€์ฐฐ์ž…๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ํ† ํฐ๋“ค์ด ์˜๋ฏธ์  ์ค‘์š”์„ฑ๊ณผ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋น„๋Œ€์นญ์ ์œผ๋กœ ํฐ ์–ดํ…์…˜ ์ ์ˆ˜๋ฅผ ๋ฐ›๋Š”๋‹ค๋Š” ๊ฒƒ โ€” ์ด๋ฅผ attention sink๋ผ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

StreamingLLM Analysis

๊ทธ๋ฆผ 2. ์ดˆ๊ธฐ ํ† ํฐ์— ์ ๋ฆฌ๋Š” ์–ดํ…์…˜ ๋ถ„ํฌ์™€ ๊ทธ ์›์ธ(softmax ํ•ฉ = 1)

์™œ ์ดˆ๊ธฐ ํ† ํฐ์ด sink๊ฐ€ ๋˜๋Š”๊ฐ€

  • softmax ์ œ์•ฝ โ€” ์–ดํ…์…˜์€ softmax๋กœ ๊ณ„์‚ฐ๋˜์–ด ๋ชจ๋“  ์ ์ˆ˜์˜ ํ•ฉ์ด 1์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ž‰์—ฌ ์–ดํ…์…˜์„ ์–ด๋”˜๊ฐ€์— ๋ฐ˜๋“œ์‹œ ๋ฐฐ๋ถ„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • ๋ฐฐ์ˆ˜๊ตฌ ์—ญํ•  โ€” ํ˜„์žฌ ํ† ํฐ์— ๊ฐ•ํ•˜๊ฒŒ ๋งค์นญ๋˜๋Š” ๊ณผ๊ฑฐ๊ฐ€ ์—†์„ ๋•Œ, ๋‚จ๋Š” ์–ดํ…์…˜์„ ๋ฒ„๋ฆด ๊ณณ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. autoregressive ๊ตฌ์กฐ์ƒ ๊ฑฐ์˜ ๋ชจ๋“  ํ›„์† ํ† ํฐ์ด ์ดˆ๊ธฐ ํ† ํฐ์„ ๋ณด๋ฏ€๋กœ(ํ•ญ์ƒ ๋ณด์ด๋Š” ์•ˆ์ •์  ์œ„์น˜), ์ดˆ๊ธฐ ํ† ํฐ์ด ์ž‰์—ฌ ์–ดํ…์…˜์˜ ๋ฐฐ์ˆ˜๊ตฌ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

  • ๋ถ•๊ดด์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜ โ€” ๊ทธ๋ž˜์„œ window attention์ด ์ดˆ๊ธฐ ํ† ํฐ์„ ๋ฒ„๋ฆฌ๋ฉด softmax ๋ถ„๋ชจ(์–ดํ…์…˜ ๋ถ„ํฌ)๊ฐ€ ๊ธ‰๋ณ€ํ•ด ๋ชจ๋ธ์ด ๋ฌด๋„ˆ์ง‘๋‹ˆ๋‹ค. ์œ ์‚ฌํ•œ ๊ด€์ฐฐ์ด ์–‘์žํ™” outlier์™€ SoftMax-Off-by-One(softmax1)์—์„œ๋„ ๋ณด๊ณ ๋˜์—ˆ๋Š”๋ฐ, ๋ชจ๋‘ '์ž‰์—ฌ๋ฅผ ๋ฒ„๋ฆด ๊ณณ์„ ์ฃผ์ž'๋Š” ๊ฐ™์€ ํ†ต์ฐฐ์ž…๋‹ˆ๋‹ค.

๊ฒฐ์ •์  ์‹คํ—˜ โ€” ์ดˆ๊ธฐ 4ํ† ํฐ์„ ์˜๋ฏธ ์—†๋Š” ์ค„๋ฐ”๊ฟˆ ํ† ํฐ('\n')์œผ๋กœ ๋ฐ”๊ฟ”๋„ perplexity๊ฐ€ ๋ณต์›๋œ๋‹ค(5.60). ์ฆ‰ ์ดˆ๊ธฐ ํ† ํฐ์˜ '๋‚ด์šฉ'์ด ์•„๋‹ˆ๋ผ '์œ„์น˜(๊ฑฐ๊ธฐ ํ† ํฐ์ด ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค)'๊ฐ€ ์ค‘์š”ํ•˜๋‹ค โ€” sink๋Š” ์˜๋ฏธ ์šด๋ฐ˜์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ ๊ตฌ์กฐ์  ์•ต์ปค๋‹ค.

3. ํ•ด๋ฒ• โ€” StreamingLLM (sink + ์ตœ๊ทผ ์œˆ๋„์šฐ)

๋ฐœ๊ฒฌ์„ ํ•ด๋ฒ•์œผ๋กœ ์˜ฎ๊ธฐ๋ฉด ๋‹จ์ˆœํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ํ† ํฐ ๋ช‡ ๊ฐœ(sink)๋ฅผ ํ•ญ์ƒ ์œ ์ง€ํ•˜๊ณ , ๊ฑฐ๊ธฐ์— ์ตœ๊ทผ sliding window๋ฅผ ๋”ํ•ฉ๋‹ˆ๋‹ค.

StreamingLLM Analysis

๊ทธ๋ฆผ 3. sink + ์ตœ๊ทผ ์œˆ๋„์šฐ ๊ตฌ์กฐ, rolling cache, ๊ทธ๋ฆฌ๊ณ  ์บ์‹œ ๋‚ด ์ƒ๋Œ€ ์œ„์น˜ ์ธ์ฝ”๋”ฉ

ํ•ต์‹ฌ

  • sink + window โ€” ์ดˆ๊ธฐ sink๊ฐ€ ์ž‰์—ฌ ์–ดํ…์…˜์˜ ๋ฐฐ์ˆ˜๊ตฌ ์—ญํ• ์„ ๊ณ„์† ํ•ด์ค˜ ์–ดํ…์…˜ ๋ถ„ํฌ๊ฐ€ ์•ˆ์ •๋˜๊ณ  ๋ถ•๊ดด๊ฐ€ ์‚ฌ๋ผ์ง‘๋‹ˆ๋‹ค. fine-tuning์ด ํ•„์š” ์—†์–ด ์ด๋ฏธ ํ•™์Šต๋œ ๋ชจ๋ธ์— ์บ์‹œ ๊ด€๋ฆฌ๋งŒ ๋ฐ”๊ฟ” ์ฆ‰์‹œ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • rolling cache โ€” ์ƒˆ ํ† ํฐ์ด ์˜ค๋ฉด ์œˆ๋„์šฐ ์™ผ์ชฝ ๋์„ ์ถ•์ถœํ•˜๊ณ  sink๋Š” ํ•ญ์ƒ ๋ณด์กดํ•˜๋ฏ€๋กœ, KV ํฌ๊ธฐ๊ฐ€ ์‹œํ€€์Šค ๊ธธ์ด์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ผ์ •(bounded)ํ•ฉ๋‹ˆ๋‹ค.

  • ์œ„์น˜ ์ธ์ฝ”๋”ฉ โ€” ์›๋ฌธ์˜ ์ ˆ๋Œ€ ์œ„์น˜๊ฐ€ ์•„๋‹ˆ๋ผ ํ˜„์žฌ ์บ์‹œ ๋‚ด ์ƒ๋Œ€ ์œ„์น˜๋ฅผ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์•ผ ํ† ํฐ ์ธ๋ฑ์Šค๊ฐ€ ํ•™์Šต ๊ธธ์ด๋ฅผ ๋„˜์ง€ ์•Š์•„ ๋ชจ๋ธ์ด ์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

4. ๊ฒฐ๊ณผ์™€ KV ์ถ•์ถœ ๋น„๊ต

StreamingLLM Analysis

๊ทธ๋ฆผ 4. ๊ฒฐ๊ณผ, dedicated sink token, ๊ทธ๋ฆฌ๊ณ  KV ์ถ•์ถœ(eviction) ๊ณ„์—ด ๋น„๊ต

๊ฒฐ๊ณผ

  • Llama-2ยทMPTยทFalconยทPythia๋ฅผ fine-tuning ์—†์ด ์ตœ๋Œ€ 4M ํ† ํฐ ์ด์ƒ ์•ˆ์ •์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, oracle(sliding window ์žฌ๊ณ„์‚ฐ) ์ˆ˜์ค€์˜ ์•ˆ์ •์  perplexity๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

  • ๋งค๋ฒˆ ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜๋Š” sliding window ์žฌ๊ณ„์‚ฐ ๋Œ€๋น„ ์ตœ๋Œ€ 22.2๋ฐฐ ๋น ๋ฆ…๋‹ˆ๋‹ค(์บ์‹œ ํฌ๊ธฐ๊ฐ€ ์ผ์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ).

  • Dedicated sink token โ€” ์‚ฌ์ „ํ•™์Šต ๋•Œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ placeholder ํ† ํฐ 1๊ฐœ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด sink 1๊ฐœ๋งŒ์œผ๋กœ ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๋‹ค(Zero Sink / Sink Token ablation). ์ด ํ†ต์ฐฐ์€ ์ดํ›„ ์ผ๋ถ€ ๋ชจ๋ธ์ด ํ•™์Šต ๊ฐ€๋Šฅํ•œ sink๋ฅผ ๋‘๋Š” ์„ค๊ณ„์— ์˜ํ–ฅ์„ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

KV ์ถ•์ถœ(eviction) ๊ณ„์—ด ๋น„๊ต

StreamingLLM์€ ์œ„์น˜ ๊ธฐ๋ฐ˜(์ดˆ๊ธฐ sink + ์ตœ๊ทผ)์ด๋ผ ์ €๋น„์šฉยท๋ฌดํ•™์Šต์ด ๊ฐ•์ ์ž…๋‹ˆ๋‹ค. H2O๋Š” ๋ˆ„์  ์–ดํ…์…˜ ์ ์ˆ˜๋กœ '์ค‘์š” ํ† ํฐ(heavy hitter)'์„ ๋™์ ์œผ๋กœ ๊ณจ๋ผ ๋‚จ๊ธฐ์ง€๋งŒ ์ ์ˆ˜ ๊ณ„์‚ฐ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋‘ ๋ฐฉ๋ฒ• ๋ชจ๋‘ ์œˆ๋„์šฐ/์„ ํƒ ๋ฐ–์˜ ํ† ํฐ์„ ์˜๊ตฌ ์ถ•์ถœํ•˜๋ฏ€๋กœ, ์ค‘๊ฐ„ ์ •๋ณด๊ฐ€ ์‚ฌ๋ผ์ง€๋Š” 'lost-in-the-middle' ํ•œ๊ณ„๋ฅผ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.

5. ์ค‘์š”ํ•œ ์˜คํ•ด์™€ KV ์ ˆ๊ฐ์—์„œ์˜ ์œ„์น˜

StreamingLLM Analysis

๊ทธ๋ฆผ 5. '๋ฌธ๋งฅ ํ™•์žฅ์ด ์•„๋‹ˆ๋‹ค'๋ผ๋Š” caveat์™€ KV ์ ˆ๊ฐ์˜ ๋„ค ์ถ•์—์„œ์˜ ์œ„์น˜

ํ”ํ•œ ์˜คํ•ด โ€” '๋ฌธ๋งฅ ๊ธธ์ด ํ™•์žฅ'์ด ์•„๋‹ˆ๋‹ค

StreamingLLM์ด ๊ฐ€๋Šฅ์ผ€ ํ•˜๋Š” ๊ฒƒ์€ ๋ฌดํ•œ '์ŠคํŠธ๋ฆฌ๋ฐ'(์œ ์ฐฝ์„ฑ๊ณผ ์•ˆ์ •์„ฑ์„ ์œ ์ง€ํ•˜๋ฉฐ ๋์—†์ด ์ƒ์„ฑ)์ด์ง€, ๋ฌธ๋งฅ ์ฐฝ์˜ 'ํ™•์žฅ'์ด ์•„๋‹™๋‹ˆ๋‹ค. ์œˆ๋„์šฐ ๋ฐ–์œผ๋กœ ๋ฐ€๋ ค๋‚œ ํ† ํฐ์€ ์žŠํžˆ๋ฏ€๋กœ, ๋ชจ๋ธ์ด ๋จผ ๊ณผ๊ฑฐ๋ฅผ recallํ•˜์ง€๋Š” ๋ชปํ•ฉ๋‹ˆ๋‹ค. StreamEval์—์„œ 120K ํ† ํฐ๊นŒ์ง€ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜์ง€๋งŒ, ์ด๋Š” ๋‹ต์ด 20์ค„ ์ „์ฏค์— ์žˆ๋Š” '์ตœ๊ทผ ์ •๋ณด' ์งˆ๋ฌธ์ด๋ผ๋Š” ์ ์„ ์œ ์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋จผ ๊ณผ๊ฑฐ๋ฅผ ๊ธฐ์–ตํ•ด์•ผ ํ•˜๋Š” ์žฅ๋ฌธ ์ดํ•ด ๊ณผ์ œ์—๋Š” ๋ถ€์ ํ•ฉํ•˜๋ฉฐ, ๊ทธ๊ฒƒ์€ context extension ๊ธฐ๋ฒ•(LongChatยทLlama-2-32K ๋“ฑ)์˜ ๋ชซ์ž…๋‹ˆ๋‹ค. ์˜คํžˆ๋ ค ๋‘˜์€ ์ƒ๋ณด์ ์ž…๋‹ˆ๋‹ค โ€” context extension์€ streaming LLM์˜ max cache size๋ฅผ ๋„“ํžˆ๋Š” ๊ฒƒ์œผ๋กœ ๊ฒฐํ•ฉ๋ฉ๋‹ˆ๋‹ค.

KV ์ ˆ๊ฐ์˜ ๋„ค ์ถ•์—์„œ์˜ ์œ„์น˜

ํ•ต์‹ฌ ์ •๋ฆฌ โ€” StreamingLLM์€ ์ดˆ๊ธฐ sink + ์ตœ๊ทผ ์œˆ๋„์šฐ๋กœ ๋ฌดํ•œ ์ŠคํŠธ๋ฆฌ๋ฐ์„ ์ผ์ • ๋ฉ”๋ชจ๋ฆฌ๋กœ ์•ˆ์ • ์ฒ˜๋ฆฌํ•˜๋Š”(๋ฌดํ•™์Šต) ๊ธฐ๋ฒ•์ด๋‹ค. ํ•ต์‹ฌ ๋ฐœ๊ฒฌ์ธ attention sink๋Š” softmax ํ•ฉ=1 ์ œ์•ฝ ๋•Œ๋ฌธ์— ์ดˆ๊ธฐ ํ† ํฐ์ด ์ž‰์—ฌ ์–ดํ…์…˜์˜ ๋ฐฐ์ˆ˜๊ตฌ๊ฐ€ ๋˜๋Š” ํ˜„์ƒ์œผ๋กœ, window attention์ด ์ดˆ๊ธฐ ํ† ํฐ์„ ๋ฒ„๋ฆฌ๋ฉด ๋ถ•๊ดดํ•˜๋Š” ์ด์œ ๋ฅผ ์„ค๋ช…ํ•œ๋‹ค. ํ•ด๋ฒ•์€ sink๋ฅผ ๋ณด์กดํ•˜๊ณ  ์ตœ๊ทผ ์œˆ๋„์šฐ๋ฅผ rollingํ•˜๋ฉฐ ์บ์‹œ ๋‚ด ์ƒ๋Œ€ ์œ„์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๊ฒƒ์ด๊ณ , ๊ฒฐ๊ณผ์ ์œผ๋กœ 4M ํ† ํฐ+๋ฅผ sliding window ์žฌ๊ณ„์‚ฐ ๋Œ€๋น„ ์ตœ๋Œ€ 22.2๋ฐฐ ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•œ๋‹ค. ๊ฐ€์žฅ ์ค‘์š”ํ•œ caveat์€ ์ด๊ฒƒ์ด ๋ฌธ๋งฅ 'ํ™•์žฅ'์ด ์•„๋‹ˆ๋ผ '์ŠคํŠธ๋ฆฌ๋ฐ'์„ ๊ฐ€๋Šฅ์ผ€ ํ•œ๋‹ค๋Š” ์ ์ด๋‹ค โ€” ์œˆ๋„์šฐ ๋ฐ–์€ ์žŠ์œผ๋ฏ€๋กœ context extension๊ณผ๋Š” ์ƒ๋ณด์ ์ด๋‹ค. KV ์ ˆ๊ฐ์—์„œ StreamingLLM์€ 'ํ† ํฐ' ์ถ•์œผ๋กœ, ์–‘์žํ™”(๋น„ํŠธ)ยทGQA/MLA(ํ—ค๋“œ/์••์ถ•)ยทPagedAttention(๋‚ญ๋น„)๊ณผ ์ง๊ตํ•œ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ๊ด€์ (์—ฐ๊ตฌ ํ™•์žฅ) โ€” StreamingLLM์˜ ๊ฐ€์žฅ ํฐ ์‹œ์Šคํ…œ์  ํ•จ์˜๋Š” KV ์บ์‹œ ํฌ๊ธฐ๊ฐ€ ์‹œํ€€์Šค ๊ธธ์ด์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ผ์ •(bounded)ํ•˜๋‹ค๋Š” ์ ์ด๋‹ค โ€” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์˜ˆ์ธก ๊ฐ€๋Šฅํ•ด์ง€๊ณ  OOM์„ ํšŒํ”ผํ•œ๋‹ค. ๋‹ค๋งŒ ์ค‘๊ฐ„ ํ† ํฐ์„ ์˜๊ตฌ ์ถ•์ถœํ•˜๋ฏ€๋กœ(lost-in-the-middle), ๋จผ ๊ณผ๊ฑฐ recall์ด ํ•„์š”ํ•˜๋ฉด KV ์˜คํ”„๋กœ๋”ฉ(์ถ•์ถœ ๋Œ€์‹  ํ•˜์œ„ ํ‹ฐ์–ด๋กœ ๋‚ด๋ฆผ)์ด๋‚˜ ๊ฒ€์ƒ‰(RAG)๊ณผ ๋ณ‘ํ–‰ํ•ด์•ผ ํ•œ๋‹ค. ์ฆ‰ '๋ฒ„๋ฆด ๊ฒƒ์€ ๋ฒ„๋ฆฌ๋˜ ํ•„์š”ํ•˜๋ฉด ์–ด๋”” ๋‘๋‚˜'๋ผ๋Š” ์ถ•์ถœโ†”์˜คํ”„๋กœ๋”ฉ์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๊ฐ€ ์ž์—ฐ์Šค๋Ÿฌ์šด ์„ค๊ณ„ ์ถ•์ด ๋œ๋‹ค.

์ฃผ์˜ โ€” ๋ณธ๋ฌธ ์ˆ˜์น˜๋Š” ์›๋…ผ๋ฌธ(StreamingLLM, Xiao et al., ICLR 2024, arXiv 2309.17453)๊ณผ ๊ด€๋ จ ์—ฐ๊ตฌ(H2O arXiv 2306.14048 ๋“ฑ)์˜ ๋ณด๊ณ ๊ฐ’์ด๋‹ค. 'PPL 5158 vs 5.40', '4M ํ† ํฐ', '22.2๋ฐฐ speedup', 'sink 4๊ฐœ', 'StreamEval 120K'๋Š” ํŠน์ • ๋ชจ๋ธยท๋ฐ์ดํ„ฐ์…‹ยท์กฐ๊ฑด์˜ ๋ณด๊ณ ๊ฐ’์œผ๋กœ ์ผ๋ฐ˜ํ™”์— ์ฃผ์˜๊ฐ€ ํ•„์š”ํ•˜๋‹ค. attention sink๋Š” ํ›„์† ์—ฐ๊ตฌยทํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ํ™œ๋ฐœํžˆ ํ™œ์šฉยทํ™•์žฅ๋˜๊ณ  ์žˆ์œผ๋ฉฐ, sink ํ† ํฐ ์ˆ˜ยท์œˆ๋„์šฐ ํฌ๊ธฐ์˜ ์ตœ์ ๊ฐ’์€ ๋ชจ๋ธยท์›Œํฌ๋กœ๋“œ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค.

# ์บ์‹œ = ์ดˆ๊ธฐ sink ํ† ํฐ + ์ตœ๊ทผ sliding window (ํฌ๊ธฐ ์ผ์ •)
cache = sink_tokens[0:4] + recent_window[-W:]  # ์˜ˆ: 4 + 1020

# ์ƒˆ ํ† ํฐ์ด ๋“ค์–ด์˜ค๋ฉด: ์œˆ๋„์šฐ ์™ผ์ชฝ ๋์„ ์ถ•์ถœ, sink๋Š” ํ•ญ์ƒ ๋ณด์กด(rolling)
def on_new_token(tok):
    recent_window.append(tok)
    if len(recent_window) > W:
        recent_window.pop_left()  # ์ค‘๊ฐ„ ํ† ํฐ ์˜๊ตฌ ์ถ•์ถœ

# sink_tokens๋Š” ์ ˆ๋Œ€ ์ถ•์ถœํ•˜์ง€ ์•Š์Œ -> KV ํฌ๊ธฐ bounded
# ์œ„์น˜ ์ธ์ฝ”๋”ฉ: ์›๋ฌธ ์ ˆ๋Œ€ ์œ„์น˜๊ฐ€ ์•„๋‹ˆ๋ผ ์บ์‹œ ๋‚ด ์ƒ๋Œ€ ์œ„์น˜๋ฅผ ๋ถ€์—ฌ
positions = range(len(cache))  # ํ•™์Šต ๊ธธ์ด๋ฅผ ๋„˜์ง€ ์•Š์•„ ์•ˆ์ •์ 
๊ธฐ๋ฒ• ๋ฌด์—‡์„ ๋‚จ๊ธฐ๋‚˜ ํŠน์ง•
StreamingLLM ์ดˆ๊ธฐ sink + ์ตœ๊ทผ ์œˆ๋„์šฐ ์œ„์น˜ ๊ธฐ๋ฐ˜, ์ €๋น„์šฉ, ๋ฌดํ•™์Šต
H2O (Heavy-Hitter) heavy hitter + ์ตœ๊ทผ ์–ดํ…์…˜ ์ ์ˆ˜ ๋ˆ„์  ๊ธฐ๋ฐ˜(๋™์  ์„ ํƒ)
Sliding window ์ตœ๊ทผ ํ† ํฐ๋งŒ ๋‹จ์ˆœํ•˜๋‚˜ ์ดˆ๊ธฐ ํ† ํฐ ์žƒ์œผ๋ฉด ๋ถ•๊ดด
์ถ• ๋Œ€ํ‘œ ๊ธฐ๋ฒ• ๋ฌด์—‡์„ ์ค„์ด๋‚˜
ํ† ํฐ ์ถ•์ถœ StreamingLLMยทH2O ์œ ์ง€ํ•  ํ† ํฐ ์ˆ˜(์บ์‹œ ํฌ๊ธฐ๋ฅผ ์ผ์ •ํ•˜๊ฒŒ)
๋น„ํŠธ(์–‘์žํ™”) KIVIยทKVQuant ํ† ํฐ๋‹น ์ •๋ฐ€๋„(FP16โ†’INT2/4)
ํ—ค๋“œ/์••์ถ• GQAยทMQAยทMLA KV ํ—ค๋“œ ์ˆ˜ ๋˜๋Š” ์ €์ฐจ์› ์••์ถ•
๋‚ญ๋น„ ์ œ๊ฑฐ PagedAttention ๋‹จํŽธํ™”ยท๊ณผ๋‹ค ์˜ˆ์•ฝ์œผ๋กœ ์ธํ•œ ๋‚ญ๋น„