Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

Sliding Window Attention Analysis

Sliding Window Attention ์‹ฌ์ธต ๋ถ„์„

๊ตญ์†Œ ์–ดํ…์…˜ ์•„ํ‚คํ…์ฒ˜ ยท Receptive Field ยท Rolling Buffer Cache ยท LongformerยทMistralยทGemma ยท SWA vs StreamingLLM

Sliding Window Attention(SWA)์€ ๊ฐ ํ† ํฐ์ด ๋ชจ๋“  ๊ณผ๊ฑฐ๊ฐ€ ์•„๋‹ˆ๋ผ '์ตœ๊ทผ W๊ฐœ' ํ† ํฐ๋งŒ ๋ณด๋„๋ก ์–ดํ…์…˜์„ ๊ตญ์†Œํ™”ํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค. ์—ฐ์‚ฐ์„ O(nยฒ)์—์„œ O(nร—W)๋กœ ๋‚ฎ์ถฐ ๊ธด ์‹œํ€€์Šค๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ , KV ์บ์‹œ๋ฅผ W์— ๊ณ ์ •ํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ๊ตญ์†Œ ์–ดํ…์…˜ ๋ ˆ์ด์–ด๋ฅผ ์Œ“์œผ๋ฉด CNN์ฒ˜๋Ÿผ receptive field๊ฐ€ ๊นŠ์ด์— ๋น„๋ก€ํ•ด ์ปค์ง„๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

์ด ๋•Œ๋ฌธ์— ์œˆ๋„์šฐ๋ณด๋‹ค ํ›จ์”ฌ ๋ฉ€๋ฆฌ๊นŒ์ง€ ์ •๋ณด๊ฐ€ ์ „ํŒŒ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ค‘์š”ํ•œ ๊ฒƒ์€ SWA๊ฐ€ 'ํ•™์Šต ์‹œ์ ์—' ๋ชจ๋ธ์— ๋‚ด์žฅ๋˜๋Š” ์•„ํ‚คํ…์ฒ˜๋ผ๋Š” ์ ์œผ๋กœ, ์ถ”๋ก  ์‹œ์  KV ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์ธ StreamingLLM๊ณผ๋Š” ๊ฒฐ์ •์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์‹ค์ œ ๊ตฌํ˜„์—์„œ๋Š” Longformer์ฒ˜๋Ÿผ ๊ตญ์†Œ window์— global ํ† ํฐ์„ ์„ž๊ฑฐ๋‚˜, BigBird์ฒ˜๋Ÿผ random ํŒจํ„ด์„ ๋”ํ•ด ์žฅ๊ฑฐ๋ฆฌ ์ ‘๊ทผ์„ ๋ณด๊ฐ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ Mistral 7B๋Š” ๋ชจ๋“  ๋ ˆ์ด์–ด์— SWA๋ฅผ ์ ์šฉํ•˜๊ณ , Gemma 2๋Š” local-global attentions๋ฅผ ์ธํ„ฐ๋ฆฌ๋ธŒํ•ด ํšจ์œจ๊ณผ ์žฅ๊ฑฐ๋ฆฌ recall์„ ํ•จ๊ป˜ ์žก์Šต๋‹ˆ๋‹ค. ๋ณธ ๋ฌธ์„œ๋Š” ํ•ต์‹ฌ ์•„์ด๋””์–ด โ†’ receptive field โ†’ KV ์ด๋“ โ†’ ๋ณ€ํ˜•ยท๋ชจ๋ธ(LongformerยทMistralยทGemma) โ†’ SWA vs StreamingLLM๊ณผ trade-off์˜ ์ˆœ์„œ๋กœ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

1. ํ•ต์‹ฌ ์•„์ด๋””์–ด โ€” Full vs Sliding Window ์–ดํ…์…˜

ํ‘œ์ค€ ์–ดํ…์…˜์€ ๊ฐ ํ† ํฐ์ด ๋ชจ๋“  ๊ณผ๊ฑฐ ํ† ํฐ์„ ๋ด…๋‹ˆ๋‹ค. ์‹œํ€€์Šค ๊ธธ์ด n์— ๋Œ€ํ•ด O(nยฒ) ์—ฐ์‚ฐยท๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋“ค์–ด ๊ธด ์ž…๋ ฅ์—์„œ ๋น„์‹ธ์ง‘๋‹ˆ๋‹ค. Sliding Window Attention์€ ๊ฐ ํ† ํฐ์ด ์ตœ๊ทผ W๊ฐœ๋งŒ ๋ณด๋„๋ก ์–ดํ…์…˜์„ ๊ตญ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด: Full vs Sliding Window ์–ดํ…์…˜

๊ทธ๋ฆผ 1. Full attention(์‚ผ๊ฐ ๋งˆ์Šคํฌ, O(nยฒ))๊ณผ Sliding Window Attention(๋  ๋งˆ์Šคํฌ, O(nร—W))

Sliding Window Attention์ด๋ž€

  • ๊ตญ์†Œ ์–ดํ…์…˜(local attention) โ€” ๊ฐ ํ† ํฐ์ด ์ตœ๊ทผ W๊ฐœ ํ† ํฐ๋งŒ ๋ด…๋‹ˆ๋‹ค. ์–ดํ…์…˜ ๋งˆ์Šคํฌ๊ฐ€ ๋Œ€๊ฐ์„ ์„ ๋”ฐ๋ผ ํญ W์˜ '๋ ' ๋ชจ์–‘์ด ๋ฉ๋‹ˆ๋‹ค.

  • ์—ฐ์‚ฐ๋Ÿ‰ โ€” O(nร—W)๋กœ, W๋ฅผ ๊ณ ์ •ํ•˜๋ฉด ์‹œํ€€์Šค ๊ธธ์ด์— ์„ ํ˜•์ž…๋‹ˆ๋‹ค(full์˜ O(nยฒ) ๋Œ€๋น„ ๊ธด ์ž…๋ ฅ์— ์œ ๋ฆฌ).

  • ์•„ํ‚คํ…์ฒ˜ ์„ ํƒ โ€” SWA๋Š” ๋ชจ๋ธ์„ ํ•™์Šต ์‹œ์ ์— ์ด ํŒจํ„ด์œผ๋กœ ์„ค๊ณ„ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ถ”๋ก  ์‹œ์ ์— KV๋ฅผ ์ž๋ฅด๋Š” ํŠธ๋ฆญ์ด ์•„๋‹ˆ๋ฉฐ, ์ด ์ ์ด StreamingLLM๊ณผ ๋Œ€๋น„๋ฉ๋‹ˆ๋‹ค.

  • ๊ธฐ์› โ€” Sparse Transformers(Child et al. 2019)์™€ Longformer(Beltagy et al. 2020)๊ฐ€ ๋„์ž…ํ–ˆ๊ณ , Mistral 7B๊ฐ€ LLM์— ๋ณธ๊ฒฉ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. '๊ตญ์†Œ๋งŒ ๋ณด๋ฉด ๋จผ ํ† ํฐ์€ ๋ชป ๋ณด์ง€ ์•Š๋‚˜'๋ผ๋Š” ์˜๋ฌธ์€ ๋‹ค์Œ ์ ˆ(receptive field)์—์„œ ํ•ด์†Œ๋ฉ๋‹ˆ๋‹ค.

2. ํ•ต์‹ฌ ํ†ต์ฐฐ โ€” Receptive Field๊ฐ€ ๋ ˆ์ด์–ด๋กœ ์ปค์ง„๋‹ค

SWA์˜ ํ•ต์‹ฌ์€ ๋‹จ์ผ ๋ ˆ์ด์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ์Œ“์ธ ๋ ˆ์ด์–ด ์ „์ฒด์—์„œ ๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ•œ ๋ ˆ์ด์–ด์—์„œ๋Š” W๊ฐœ๋งŒ ์ง์ ‘ ๋ณด์ง€๋งŒ, ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์น˜๋ฉด ์ •๋ณด๊ฐ€ ์œˆ๋„์šฐ ๊ฒฝ๊ณ„๋ฅผ ๋„˜์–ด ์ „ํŒŒ๋ฉ๋‹ˆ๋‹ค.

Sliding Window Attention Analysis

๊ทธ๋ฆผ 2. ๋ ˆ์ด์–ด๋ฅผ ์Œ“์œผ๋ฉด receptive field๊ฐ€ Wร—k๋กœ ์ปค์ง€๋Š” ์›๋ฆฌ(CNN๊ณผ ์œ ์‚ฌ)

์™œ ๊ตญ์†Œ ์–ดํ…์…˜์œผ๋กœ๋„ ๋ฉ€๋ฆฌ ๋ณด๋Š”๊ฐ€

  • ์žฌ๊ท€์  ์ „ํŒŒ โ€” ๋ ˆ์ด์–ด k์˜ ํ† ํฐ i๋Š” ์ง์ „ ๋ ˆ์ด์–ด์˜ [i-W, i]๋ฅผ ๋ณด๊ณ , ๊ทธ ํ† ํฐ๋“ค์€ ๋‹ค์‹œ ๋” ์ด์ „ ๋ ˆ์ด์–ด์˜ ์ตœ๊ทผ W๊ฐœ๋ฅผ ๋ด…๋‹ˆ๋‹ค. ์žฌ๊ท€์ ์œผ๋กœ ์ž…๋ ฅ์ธต ๊ธฐ์ค€ ์•ฝ Wร—k ๊ฑฐ๋ฆฌ์˜ ํ† ํฐ๊นŒ์ง€ ๊ฐ„์ ‘ ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค.

  • CNN๊ณผ ๊ฐ™์€ ์›๋ฆฌ โ€” ์ž‘์€ ์ปค๋„(์œˆ๋„์šฐ)์„ ์—ฌ๋Ÿฌ ์ธต ์Œ“์•„ ๋„“์€ ์ˆ˜์šฉ ์˜์—ญ(receptive field)์„ ๋งŒ๋“œ๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

  • Mistral 7B ์‚ฌ๋ก€ โ€” W=4096, 32 ๋ ˆ์ด์–ด๋ฉด ์ตœ์ข… ๋ ˆ์ด์–ด์˜ ์ด๋ก ์  ์–ดํ…์…˜ ๋ฒ”์œ„๋Š” ์•ฝ 131K ํ† ํฐ์ž…๋‹ˆ๋‹ค(4096ร—32). '์œˆ๋„์šฐ ๋ฐ– ํ† ํฐ๋„ ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก์— ์˜ํ–ฅ์„ ์ค€๋‹ค'๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

ํšจ์œจ๊ณผ ํ‘œํ˜„๋ ฅ์˜ ์ ˆ์ถฉ โ€” ์—ฐ์‚ฐ์€ O(nร—W)๋กœ ์‹ธ๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ ์ •๋ณด ํ๋ฆ„์€ ๊นŠ์ด๋ฅผ ํ†ตํ•ด ๋ฉ€๋ฆฌ ๋‹ฟ๊ฒŒ ํ•œ๋‹ค. ๋‹ค๋งŒ ์ด๋Š” '๊ฐ„์ ‘' ๊ฒฝ๋กœ๋‹ค โ€” ์ง์ ‘ ์–ดํ…์…˜์ด ์•„๋‹ˆ๋ฏ€๋กœ, ๋จผ ๊ฑฐ๋ฆฌ์˜ ์ •๋ฐ€ํ•œ recall์€ ์•ฝํ•ด์ง„๋‹ค(๋’ค์˜ trade-off์—์„œ ๋‹ค๋ฃฌ๋‹ค).

3. KV ์บ์‹œ ์ด๋“ โ€” Rolling Buffer Cache

๊ณ ์ •๋œ ์–ดํ…์…˜ ๋ฒ”์œ„๋Š” ๋ฉ”๋ชจ๋ฆฌ์— ์ง์ ‘์ ์ธ ์ด์ ์„ ์ค๋‹ˆ๋‹ค. ๊ฐ ๋ ˆ์ด์–ด์—์„œ ์ตœ๊ทผ W๊ฐœ ํ† ํฐ์˜ KV๋งŒ ์žˆ์œผ๋ฉด ๋˜๋ฏ€๋กœ, KV ์บ์‹œ๋ฅผ W ํฌ๊ธฐ์— ๊ณ ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Sliding Window Attention Analysis

๊ทธ๋ฆผ 3. ์ˆœํ™˜ ๋ฒ„ํผ(rolling buffer)์™€ ๋ฉ”๋ชจ๋ฆฌ ๋น„๊ต(Full์€ O(n) ์ฆ๊ฐ€, SWA๋Š” bounded)

์ˆœํ™˜ ๋ฒ„ํผ ๊ฐœ๋… โ€” W ํฌ๊ธฐ์˜ circular buffer๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Rolling Buffer Cache์˜ ํ•ต์‹ฌ

  • ์ˆœํ™˜ ๋ฒ„ํผ โ€” ํฌ๊ธฐ W์˜ ๋ฒ„ํผ์— KV๋ฅผ ์ €์žฅํ•˜๊ณ , ๊ฐ€๋“ ์ฐจ๋ฉด ๊ฐ€์žฅ ์˜ค๋ž˜๋œ ํ•ญ๋ชฉ์„ ๋ฎ์–ด์”๋‹ˆ๋‹ค. KV ์บ์‹œ ํฌ๊ธฐ๊ฐ€ ์‹œํ€€์Šค ๊ธธ์ด์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ผ์ •(bounded)ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๊ณ  OOM์„ ํšŒํ”ผํ•ฉ๋‹ˆ๋‹ค.

  • Mistral ๋ณด๊ณ  โ€” ์‹œํ€€์Šค ๊ธธ์ด 8192์—์„œ ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ 50% ์ ˆ๊ฐ, ํ’ˆ์งˆ ์ €ํ•˜ ์—†์Œ. ์—ฐ์‚ฐ๋„ O(nร—W)๋กœ ์„ ํ˜•์ด๋ฉฐ, FlashAttentionยทxFormers ์ตœ์ ํ™”๋กœ 16KยทW=4K์—์„œ 2๋ฐฐ ์†๋„ ํ–ฅ์ƒ์ด ๋ณด๊ณ ๋ฉ๋‹ˆ๋‹ค.

  • ์ฃผ์˜ โ€” rolling buffer๋Š” ์ธ๋ฑ์Šค ๊ด€๋ฆฌ๊ฐ€ ๊นŒ๋‹ค๋กœ์›Œ speculative decoding ๋“ฑ ์ผ๋ถ€ ์ถ”๋ก  ๊ธฐ๋ฒ•๊ณผ๋Š” ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ KV ์ ˆ๊ฐ์˜ 'ํ† ํฐ' ์ถ•์ด๋˜, ์ถ”๋ก  ํŠธ๋ฆญ์ด ์•„๋‹ˆ๋ผ ์•„ํ‚คํ…์ฒ˜๋กœ ๋‚ด์žฅ๋œ๋‹ค๋Š” ์ ์ด StreamingLLM๊ณผ์˜ ์ฐจ์ด์ž…๋‹ˆ๋‹ค.

4. ๋ณ€ํ˜•๊ณผ ๋ชจ๋ธ โ€” ๊ตญ์†Œ + ์ „์—ญ์˜ ์กฐํ•ฉ

์ˆœ์ˆ˜ ๊ตญ์†Œ ์–ดํ…์…˜๋งŒ์œผ๋กœ๋Š” ๋จผ ๊ฑฐ๋ฆฌ ์ง์ ‘ ์ ‘๊ทผ์ด ์•ฝํ•˜๋ฏ€๋กœ, ์‹ค์ œ ๋ชจ๋ธ๋“ค์€ ์ „์—ญ(global) ์–ดํ…์…˜์„ ์„ž๊ฑฐ๋‚˜ ๊ตญ์†Œยท์ „์—ญ ๋ ˆ์ด์–ด๋ฅผ ๋ฒˆ๊ฐˆ์•„ ๋‘ก๋‹ˆ๋‹ค.

Sliding Window Attention Analysis

๊ทธ๋ฆผ 4. ํŒจํ„ด ๋ณ€ํ˜•(window+global+dilated+random), ์ธํ„ฐ๋ฆฌ๋ธŒ, ์ฑ„ํƒ ๋ชจ๋ธ

ํŒจํ„ด ๋ณ€ํ˜•๊ณผ ์ธํ„ฐ๋ฆฌ๋ธŒ

  • Longformer โ€” ๊ตญ์†Œ window + dilated(๋„์—„๋„์—„ ๋„“๊ฒŒ ๋ณด๋Š”) window + ํŠน์ • ํ† ํฐ์˜ global ์–ดํ…์…˜์„ ์กฐํ•ฉํ•ฉ๋‹ˆ๋‹ค(์žฅ๋ฌธ ๋ฌธ์„œ์šฉ).

  • BigBird โ€” ๊ตญ์†Œ window + global + random(์ž„์˜) ์–ดํ…์…˜์˜ ์กฐํ•ฉ์œผ๋กœ full attention์„ ๊ทผ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

  • ์ธํ„ฐ๋ฆฌ๋ธŒ(Gemma 2) โ€” ๊ตญ์†Œ SWA ๋ ˆ์ด์–ด์™€ ์ „์—ญ full ๋ ˆ์ด์–ด๋ฅผ ๋งค ๋ ˆ์ด์–ด ๋ฒˆ๊ฐˆ์•„ ๋‘ก๋‹ˆ๋‹ค(์ ˆ๋ฐ˜์”ฉ). 2๋ ˆ์ด์–ด๋งˆ๋‹ค ์ „์ฒด ์ˆ˜์šฉ์ด ๋˜์–ด, ์ ˆ๋ฐ˜์€ ์ „์ฒด๋ฅผ ๋ณด๋ฏ€๋กœ long-context ํ’ˆ์งˆ์„ ์ง€ํ‚ค๋ฉด์„œ ์ ˆ๋ฐ˜์˜ ์—ฐ์‚ฐยท์บ์‹œ๋ฅผ ์ ˆ์•ฝํ•ฉ๋‹ˆ๋‹ค.

์ฑ„ํƒ ๋ชจ๋ธ

์ถ”์„ธ๋Š” '๊ตญ์†Œ ๋‹ค์ˆ˜ + ์ „์—ญ ์†Œ์ˆ˜'์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ์ž…๋‹ˆ๋‹ค โ€” ๋Œ€๋ถ€๋ถ„์˜ ๋ ˆ์ด์–ด๋Š” ํšจ์œจ์ ์ธ SWA๋กœ ๋‘๊ณ  ์ผ๋ถ€ ๋ ˆ์ด์–ด์— ์ „์—ญ ์–ดํ…์…˜์„ ๋‘์–ด, ํšจ์œจ๊ณผ ์žฅ๊ฑฐ๋ฆฌ recall์„ ์ ˆ์ถฉํ•˜๋Š” ๊ฒƒ์ด ์‚ฌ์‹ค์ƒ ํ‘œ์ค€์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

5. SWA vs StreamingLLM, ๊ทธ๋ฆฌ๊ณ  ์ •๋ฆฌ

Sliding Window Attention Analysis

๊ทธ๋ฆผ 5. SWA์™€ StreamingLLM์˜ ์ฐจ์ด, trade-off, ๊ทธ๋ฆฌ๊ณ  SWA์˜ ์œ„์น˜

SWA vs StreamingLLM โ€” ์ž์ฃผ ํ˜ผ๋™๋˜๋Š” ๊ตฌ๋ถ„

์ค‘์š”ํ•œ ์ ์€ StreamingLLM์ด ๋ณด์ธ '์œˆ๋„์šฐ ์–ดํ…์…˜์˜ ์‹คํŒจ'๋Š” full๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ถ”๋ก  ๋•Œ ๋‹จ์ˆœํžˆ ์ž๋ฅธ ๊ฒฝ์šฐ๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. SWA๋กœ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต๋œ ๋ชจ๋ธ(Mistral)์€ ๊ตญ์†Œ ํŒจํ„ด์„ ์ตํ˜€ ์ •์ƒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค(๋‹ค๋งŒ attention sink๋กœ ์ถ”๊ฐ€ ๋ณด๊ฐ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค). ์ฆ‰ ๋‘˜์€ ๋ฐฐํƒ€์ ์ด์ง€ ์•Š๊ณ  ์ธต์œ„๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

์žฅ๋‹จ์ 

  • ์žฅ์  โ€” ์—ฐ์‚ฐ๊ณผ KV๊ฐ€ ์„ ํ˜•/์ผ์ •(bounded)ํ•˜์—ฌ ๊ธด ์‹œํ€€์Šค๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

  • ๋‹จ์  โ€” ๋จผ ๊ฑฐ๋ฆฌ ํ† ํฐ์— ์ง์ ‘ ์ ‘๊ทผ์ด ์•ฝํ•ฉ๋‹ˆ๋‹ค(๊ฐ„์ ‘ ๊ฒฝ๋กœ์— ์˜์กด). ์žฅ๊ฑฐ๋ฆฌ ์ •๋ฐ€ recall์—์„œ ์†ํ•ด๊ฐ€ ์žˆ์–ด, ์ „์—ญ ๋ ˆ์ด์–ด ์ธํ„ฐ๋ฆฌ๋ธŒ(Gemma) ๋“ฑ์œผ๋กœ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค.

๊ด€๋ จ ๊ธฐ์ˆ 

  • Longformer: The Long-Document Transformer (https://arxiv.org/abs/2004.05150)
  • Big Bird: Transformers for Longer Sequences (https://arxiv.org/abs/2007.14062)
  • Mistral 7B (https://arxiv.org/abs/2310.06825)
  • Gemma 2: Improving Open Language Models at a Practical Size (https://arxiv.org/abs/2408.00118)
  • StreamingLLM Analysis (llm_0055_streamingllm_analysis.html)
  • KV Cache Offloading Analysis (llm_0040_kv_cache_offloading_analysis.html)
  • Memory Centric LLM Serving Survey (llm_0003_memory_centric_llm_serving_survey.html)
  • Speculative Decoding Analysis (llm_0070_speculative_decoding_analysis.html)

ํ•ต์‹ฌ ์ •๋ฆฌ

Sliding Window Attention์€ ๊ฐ ํ† ํฐ์ด ์ตœ๊ทผ W๊ฐœ๋งŒ ๋ณด๋Š” ๊ตญ์†Œ ์–ดํ…์…˜ ์•„ํ‚คํ…์ฒ˜๋กœ, ์—ฐ์‚ฐ์„ O(nร—W)๋กœ ๋‚ฎ์ถ”๊ณ  KV ์บ์‹œ๋ฅผ W์— ๊ณ ์ •(bounded)ํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ๋ ˆ์ด์–ด๋ฅผ ์Œ“์œผ๋ฉด receptive field๊ฐ€ Wร—k๋กœ ์ปค์ง„๋‹ค๋Š” ์ (CNN๊ณผ ์œ ์‚ฌ)์œผ๋กœ, ๊ตญ์†Œ ์–ดํ…์…˜์œผ๋กœ๋„ ๋ฉ€๋ฆฌ๊นŒ์ง€ ์ •๋ณด๊ฐ€ ์ „ํŒŒ๋ฉ๋‹ˆ๋‹ค. rolling buffer cache๋กœ KV๋ฅผ W์— ๊ณ ์ •ํ•ด ๋ฉ”๋ชจ๋ฆฌ 50% ์ ˆ๊ฐ๊ณผ 2๋ฐฐ ์†๋„๊ฐ€ ๋ณด๊ณ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. SWA(์•„ํ‚คํ…์ฒ˜ยทํ•™์Šต ์‹œ์ )๋Š” StreamingLLM(์ถ”๋ก  ์‹œ์  KV ๊ด€๋ฆฌ)๊ณผ ์ธต์œ„๊ฐ€ ๋‹ค๋ฅด๋ฉฐ, LongformerยทBigBirdยทGemma ๊ฐ™์€ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ณ€ํ˜•์€ ์žฅ๊ฑฐ๋ฆฌ ์ ‘๊ทผ์„ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค. KV ์ ˆ๊ฐ์˜ 'ํ† ํฐ' ์ถ•(์•„ํ‚คํ…์ฒ˜ ๋ฒ„์ „)์œผ๋กœ, ์–‘์žํ™”(๋น„ํŠธ)ยทGQA/MLA(ํ—ค๋“œ/์••์ถ•)ยทPagedAttention(๋‚ญ๋น„)๊ณผ ์ง๊ตํ•ฉ๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ๊ด€์ (์—ฐ๊ตฌ ํ™•์žฅ) โ€” SWA์˜ ๊ฐ€์žฅ ํฐ ์‹œ์Šคํ…œ์  ํ•จ์˜๋Š” KV ์บ์‹œ ํฌ๊ธฐ๊ฐ€ ์‹œํ€€์Šค ๊ธธ์ด์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ผ์ •(bounded)ํ•˜๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ธด ์ž…๋ ฅ์—์„œ๋„ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํญ์ฆํ•˜์ง€ ์•Š์•„ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๊ณ , ์ธํ„ฐ๋ฆฌ๋ธŒ ๋ชจ๋ธ์—์„œ๋Š” ๊ตญ์†Œ ๋ ˆ์ด์–ด์™€ ์ „์—ญ ๋ ˆ์ด์–ด์˜ ํ’‹ํ”„๋ฆฐํŠธ ์ฐจ์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ ˆ์ด์–ด๋ณ„ ๋ฐฐ์น˜ยท์˜คํ”„๋กœ๋”ฉยทํ‹ฐ์–ด๋ง์„ ์„ค๊ณ„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. StreamingLLM(์ถ•์ถœ)๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ '๋ฒ„๋ฆด ๊ฒƒ์€ ๋ฒ„๋ฆฌ๋˜ ํ•„์š”ํ•˜๋ฉด ์–ด๋”” ๋‘๋‚˜'๋ผ๋Š” ์ถ•์ถœ/๊ตญ์†Œํ™” โ†” ์˜คํ”„๋กœ๋”ฉ์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๊ฐ€ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ฃผ์˜ โ€” ๋ณธ๋ฌธ ์ˆ˜์น˜๋Š” ์›๋…ผ๋ฌธยท๊ธฐ์ˆ  ์ž๋ฃŒ(Mistral 7B arXiv 2310.06825, Longformer arXiv 2004.05150, BigBird arXiv 2007.14062, Gemma 2 arXiv 2408.00118 ๋“ฑ)์˜ ๋ณด๊ณ ๊ฐ’์ž…๋‹ˆ๋‹ค. 'W=4096', '131K ํ† ํฐ', '50% ์บ์‹œ ์ ˆ๊ฐ', '2๋ฐฐ ์†๋„', 'Gemma ๊ตญ์†Œ 4096/์ „์—ญ 8192'๋Š” ํŠน์ • ๋ชจ๋ธยท์กฐ๊ฑด์˜ ๋ณด๊ณ ๊ฐ’์œผ๋กœ ์ผ๋ฐ˜ํ™”์— ์ฃผ์˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์œˆ๋„์šฐ ํฌ๊ธฐยท๊ตญ์†Œ/์ „์—ญ ๋น„์œจ์˜ ์ตœ์ ๊ฐ’์€ ๋ชจ๋ธยท์›Œํฌ๋กœ๋“œ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋ฉฐ, SWA ์ฑ„ํƒ ์—ฌ๋ถ€๋„ ๋ชจ๋ธ ์„ธ๋Œ€๋งˆ๋‹ค ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค(์ผ๋ถ€ ๋ชจ๋ธ์€ ์ดํ›„ SWA๋ฅผ ๋นผ๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค). receptive field๋Š” '์ด๋ก ์ ' ๋ฒ”์œ„๋กœ ์‹ค์ œ ์žฅ๊ฑฐ๋ฆฌ ํ™œ์šฉ๋„์™€๋Š” ์ฐจ์ด๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ˆœํ™˜ ๋ฒ„ํผ ๊ฐœ๋…

slot = position % W
cache[slot] = (key, value)  # ๊ฐ€๋“ ์ฐจ๋ฉด ๊ฐ€์žฅ ์˜ค๋ž˜๋œ ๊ฒƒ์„ ๋ฎ์–ด์”€
attn = attention(query, cache[recent_W_slots])  # ์ตœ๊ทผ W๊ฐœ KV์— ๋Œ€ํ•ด์„œ๋งŒ ์ˆ˜ํ–‰
๋ชจ๋ธ ๋ฐฉ์‹
Longformer window + dilated + global (์žฅ๋ฌธ ๋ฌธ์„œ)
BigBird window + global + random
Mistral 7B ์ „ ๋ ˆ์ด์–ด SWA(W=4096) + GQA + rolling buffer cache
Gemma 2 / 3 ๊ตญ์†Œ SWA(4096) โ†” ์ „์—ญ full(8192) ์ธํ„ฐ๋ฆฌ๋ธŒ(์ ˆ๋ฐ˜์”ฉ)
๊ตฌ๋ถ„ Sliding Window Attention StreamingLLM
์„ฑ๊ฒฉ ์•„ํ‚คํ…์ฒ˜(ํ•™์Šต ์‹œ์ ์— ๋‚ด์žฅ) ์ถ”๋ก  ์‹œ์  KV ๊ด€๋ฆฌ ๊ธฐ๋ฒ•
์ ์šฉ ๋ชจ๋ธ์ด ๊ตญ์†Œ ํŒจํ„ด์œผ๋กœ ํ•™์Šต๋จ ๊ธฐ์กด ๋ชจ๋ธ์— sink + ์œˆ๋„์šฐ ์ ์šฉ
ํ•™์Šต ํ•ด๋‹น ํŒจํ„ด์œผ๋กœ ์‚ฌ์ „ํ•™์Šต ํ•„์š” ๋ฌดํ•™์Šต(์žฌํ•™์Šต ๋ถˆํ•„์š”)
์˜ˆ์‹œ Mistral, Gemma ๊ธฐ์กด full-attention ๋ชจ๋ธ์— ์ ์šฉ