Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

Speculative Decoding Analysis

Speculative Decoding ์‹ฌ์ธต ๋ถ„์„

Draft Model ยท Verification ยท Acceptance Rate ยท Medusa/EAGLE ยท Throughput vs TTFT

Speculative decoding์€ ์ž‘์€ draft๊ฐ€ ์—ฌ๋Ÿฌ ํ† ํฐ์„ ๋จผ์ € ์ œ์•ˆํ•˜๊ณ , ํฐ target model์ด ์ด๋ฅผ ํ•œ ๋ฒˆ์— ๊ฒ€์ฆํ•จ์œผ๋กœ์จ autoregressive decoding์˜ step ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ณ„์‚ฐ๋Ÿ‰์„ ์™„์ „ํžˆ ์—†์• ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ, acceptance๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋†’์œผ๋ฉด target model์˜ ๋น„์‹ผ ํ•œ ํ† ํฐ๋‹น ํ•œ ๋ฒˆ forward ๋ฃจํ”„๋ฅผ ์••์ถ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฐฉ์‹์€ ๋ชจ๋ธ ์ถœ๋ ฅ ๋ถ„ํฌ๋ฅผ ์œ ์ง€ํ•œ ์ฑ„ ๋ณ‘๋ ฌ ํ›„๋ณด๋ฅผ ๋„“ํžˆ๋Š” ๋ฐ ์ดˆ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ draft์™€ target์˜ ๋ถ„ํฌ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฐ€๊นŒ์šด์ง€, ๊ทธ๋ฆฌ๊ณ  ์„œ๋น™ ์Šคํƒ์ด accept/reject ๊ฒฝ๊ณ„๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ€๋ณ๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€๊ฐ€ ์‹ค์ œ ์„ฑ๋Šฅ์„ ์ขŒ์šฐํ•ฉ๋‹ˆ๋‹ค.

1. ๊ธฐ๋ณธ ์•„์ด๋””์–ด โ€” ์—ฌ๋Ÿฌ ํ† ํฐ์„ ํ•œ ๋ฒˆ์— ์ „์ง„ํ•˜๋ ค๋Š” ์‹œ๋„

์ผ๋ฐ˜ ๋””์ฝ”๋”ฉ์—์„œ๋Š” target model์ด ํ† ํฐ ํ•œ ๊ฐœ๋ฅผ ๋‚ผ ๋•Œ๋งˆ๋‹ค ํ•œ ๋ฒˆ์˜ decode step์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Speculative decoding์€ ๋” ์ž‘์€ draft model์ด ๋จผ์ € k๊ฐœ ํ›„๋ณด๋ฅผ ๋น ๋ฅด๊ฒŒ ์ œ์•ˆํ•˜๊ณ , target model์ด ์ด ํ›„๋ณด ๊ตฌ๊ฐ„์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.

์ดˆ๋ฐ˜ ํ›„๋ณด๋“ค์ด ๊ธธ๊ฒŒ ๋งž์œผ๋ฉด ๊ทธ๋งŒํผ ์—ฌ๋Ÿฌ ํ† ํฐ์„ ํ•œ ๋ฒˆ์— ์ „์ง„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ตญ ์†๋„ ํ–ฅ์ƒ์€ 'draft๊ฐ€ ์–ผ๋งˆ๋‚˜ ์‹ธ๊ฒŒ ๋งŽ์€ ํ›„๋ณด๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š”๊ฐ€'์™€ '๊ทธ ํ›„๋ณด๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋งž๋Š”๊ฐ€'์˜ ๊ณฑ์œผ๋กœ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.

Speculative Decoding Analysis

๊ทธ๋ฆผ 1. draft propose -> target verify -> multi-token advance์˜ ๊ฐœ๋…

2. ํ•ต์‹ฌ ๊ฐœ๋…

  • draft model: ๋น ๋ฅด๊ฒŒ ํ›„๋ณด ํ† ํฐ์„ ๋งŒ๋“œ๋Š” ์ž‘์€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
  • target model: ์ตœ์ข… ์ถœ๋ ฅ์„ ์ฑ…์ž„์ง€๋Š” ํฐ ๋ชจ๋ธ์ด์ž verifier์ž…๋‹ˆ๋‹ค.
  • speculative window: ํ•œ ๋ฒˆ์— ์ œ์•ˆํ•˜๋Š” ํ›„๋ณด ๊ธธ์ด k์ž…๋‹ˆ๋‹ค.
  • acceptance rate: ์ œ์•ˆ๋œ ํ›„๋ณด ์ค‘ ์‹ค์ œ๋กœ ์ฑ„ํƒ๋˜๋Š” ๋น„์œจ์ž…๋‹ˆ๋‹ค.
  • rollback: mismatch ์ดํ›„ ํ›„๋ณด๋ฅผ ๋ฒ„๋ฆฌ๊ณ , KV cache์™€ ์ƒ˜ํ”Œ๋ง ์ƒํƒœ๋ฅผ ๊ฒฝ๊ณ„์— ๋งž๊ฒŒ ๋˜๋Œ๋ฆฌ๋Š” ๋™์ž‘์ž…๋‹ˆ๋‹ค.
  • shared prefix KV: ์ฑ„ํƒ๋œ prefix๋Š” ์žฌ์‚ฌ์šฉํ•˜๊ณ , ๋ถ„๊ธฐ์ ์—์„œ๋งŒ ์ƒˆ ๊ฒฝ๋กœ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

3. ๋น„๊ต/๋ถ„์„

๋ฐฉ์‹ ์ œ์•ˆ ์ฃผ์ฒด ์ถ”๊ฐ€ ๋ชจ๋ธ/ํ•™์Šต KV ๊ตฌ์กฐ ์žฅ์  ์ฃผ์˜์ 
Independent draft model ๋ณ„๋„ draft model ๋ณดํ†ต ํ•™์Šต ์—†์ด ์‚ฌ์šฉ draft/target ๋ถ„๋ฆฌ ๊ตฌํ˜„์ด ์ง๊ด€์ ์ด๋‹ค ๋ชจ๋ธ 2๊ฐœ ์šด์˜๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋‹ด์ด ์žˆ๋‹ค
Self-speculation ๊ฐ™์€ ๋ชจ๋ธ์˜ ์–•์€ ์ธต ๋ณดํ†ต ๊ตฌ์กฐ ์กฐ์ • ์œ„์ฃผ ์ผ๋ถ€ ๊ณต์œ  ๋ฐฐํฌ ๋ถ€๋‹ด์ด ์ค„์–ด๋“ ๋‹ค early exit์™€ ํ’ˆ์งˆ ์กฐ์ •์ด ํ•„์š”ํ•˜๋‹ค
Medusa / EAGLE ์ถ”๊ฐ€ head ๋˜๋Š” feature predictor fine-tuning์ด ์žˆ๊ฑฐ๋‚˜ ์ ๋‹ค backbone ๊ณต์œ  ๋ณ„๋„ draft ์—†์ด ๋ณ‘๋ ฌ ํ›„๋ณด๋ฅผ ๋งŒ๋“ ๋‹ค training recipe์™€ acceptance ํŠœ๋‹์ด ์ค‘์š”ํ•˜๋‹ค

Medusa๋Š” ๋ณ„๋„ draft model ๋Œ€์‹  backbone ์œ„์— multiple decoding heads๋ฅผ ๋ถ™์ด๊ณ  tree attention์œผ๋กœ ํ›„๋ณด๋ฅผ ๋™์‹œ์— ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค. Medusa-1์€ frozen backbone ์œ„์—์„œ lossless acceleration์„ ๋…ธ๋ฆฌ๊ณ , Medusa-2๋Š” backbone๊นŒ์ง€ ํ•จ๊ป˜ fine-tuningํ•ด ๋” ๋†’์€ speedup์„ ์–ป๋Š” ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค. EAGLE์€ feature level autoregression์œผ๋กœ uncertainty๋ฅผ ์ค„์—ฌ speculative sampling์„ ์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค. Medusa ์ชฝ์€ typical acceptance์™€ self-distillation๋„ ํ•จ๊ป˜ ์ œ์•ˆํ•ด acceptance rate์™€ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ฌธ์ œ๋ฅผ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค.

4. ๋™์ž‘ ์›๋ฆฌ โ€” accept/reject์™€ rollback

์ผ๋ฐ˜์ ์ธ ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. draft model์ด k๊ฐœ ํ›„๋ณด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  2. target model์ด ํ›„๋ณด prefix๋ฅผ ํ•œ ๋ฒˆ์— ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.
  3. ์ผ์น˜ํ•œ prefix๋Š” ์ฆ‰์‹œ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค.
  4. ์ฒซ mismatch ์ง€์  ์ดํ›„ ํ† ํฐ์€ ํ๊ธฐํ•ฉ๋‹ˆ๋‹ค.
  5. ํ•„์š”ํ•œ ๊ฒฝ์šฐ fallback decode๋กœ ๋‹ค์Œ ํ† ํฐ์„ ๋‹ค์‹œ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
Speculative Decoding Analysis

๊ทธ๋ฆผ 2. accept๊ฐ€ ๊ธธ๊ฒŒ ์ด์–ด์งˆ ๋•Œ์™€ ์ค‘๊ฐ„ reject๊ฐ€ ๋ฐœ์ƒํ•  ๋•Œ์˜ ์ฐจ์ด

Rollback๊ณผ Sampling

์‹ค์ œ ์‹œ์Šคํ…œ์€ accepted prefix์™€ rejected suffix๋ฅผ ์ •ํ™•ํžˆ ๊ตฌ๋ถ„ํ•ด์•ผ ํ•˜๊ณ , ์ƒ˜ํ”Œ๋ง ์ƒํƒœ์™€ KV cache๋„ ๊ทธ ๊ฒฝ๊ณ„์— ๋งž์ถฐ ๋ณด์กดํ•˜๊ฑฐ๋‚˜ ํ๊ธฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ speculative decoding์€ ๋‹จ์ˆœํžˆ ๋‘ ๋ชจ๋ธ์„ ๋ถ™์ด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, accept-aware KV ๊ด€๋ฆฌ๊ฐ€ ํ•จ๊ป˜ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

5. ์žฅ๋‹จ์ 

์žฅ์  ๋‹จ์ 
target step ์ˆ˜๋ฅผ ์ค„์—ฌ latency๋ฅผ ๋‚ฎ์ถœ ์ˆ˜ ์žˆ๋‹ค draft ๋น„์šฉ์ด ํฌ๋ฉด ์ด๋“์ด ์ค„์–ด๋“ ๋‹ค
๊ธด ์‘๋‹ต์—์„œ ๋ˆ„์  ์ด๋“์ด ์ปค์ง€๊ธฐ ์‰ฝ๋‹ค acceptance rate๊ฐ€ ๋‚ฎ์œผ๋ฉด ์˜คํžˆ๋ ค ์†ํ•ด๊ฐ€ ๋‚  ์ˆ˜ ์žˆ๋‹ค
๊ธฐ์กด ๋ชจ๋ธ์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค draft/target ๋™์‹œ ์šด์˜์œผ๋กœ KV ๊ด€๋ฆฌ๊ฐ€ ๋ณต์žกํ•ด์ง„๋‹ค
Medusa/EAGLE์ฒ˜๋Ÿผ ๋ณ€ํ˜•์„ ํ†ตํ•ด ์šด์˜ ๋ถ€๋‹ด์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค ์›Œํฌ๋กœ๋“œ ๋ถ„ํฌ๊ฐ€ ๋ฐ”๋€Œ๋ฉด ์„ฑ๋Šฅ ํŽธ์ฐจ๊ฐ€ ์ปค์ง„๋‹ค

draft model์ด ์ถฉ๋ถ„ํžˆ ์ž‘๊ณ , target๊ณผ ๋ถ„ํฌ๊ฐ€ ๋น„์Šทํ•˜๋ฉฐ, acceptance rate๊ฐ€ ๋†’๋‹ค๋ฉด speculative decoding์€ target step ์ˆ˜๋ฅผ ํฌ๊ฒŒ ์ค„์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ๊ธด ์‘๋‹ต ์ƒ์„ฑ์—์„œ ์ด์ต์ด ์ปค์ง‘๋‹ˆ๋‹ค.

๋ฐ˜๋Œ€๋กœ draft ์ž์ฒด๊ฐ€ ๋น„์‹ธ๊ฑฐ๋‚˜ acceptance rate๊ฐ€ ๋‚ฎ์œผ๋ฉด, draft ๋น„์šฉ๊ณผ verification ๋น„์šฉ์ด ํ•ฉ์ณ์ ธ ์˜คํžˆ๋ ค ์†ํ•ด๊ฐ€ ๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰ speculative decoding์€ ๊ณต์งœ ๊ฐ€์†์ด ์•„๋‹ˆ๋ผ, distribution match๊ฐ€ ๋งž์„ ๋•Œ๋งŒ ์ด์ต์ด ๋‚˜๋Š” ์กฐ๊ฑด๋ถ€ ์ตœ์ ํ™”์ž…๋‹ˆ๋‹ค.

Speculative Decoding Analysis

๊ทธ๋ฆผ 3. draft ๋น„์šฉ, acceptance rate, ์‘๋‹ต ๊ธธ์ด์— ๋”ฐ๋ฅธ ์ด๋“/์†ํ•ด ์˜์—ญ

6. ์ฃผ์š” ๋ณ€ํ˜•

๊ฐ€์žฅ ์ง๊ด€์ ์ธ ๋ฐฉ์‹์€ ๋…๋ฆฝ๋œ ์ž‘์€ draft model์„ ๋‘๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ์™€ ์šด์˜ ๋ณต์žก๋„๊ฐ€ ์ปค์„œ, ๊ฐ™์€ ๋ชจ๋ธ์˜ ์–•์€ ์ธต์ด๋‚˜ ์ถ”๊ฐ€ head๋ฅผ ํ™œ์šฉํ•˜๋Š” self-speculation, Medusa, EAGLE ๊ณ„์—ด๋„ ๋งŽ์ด ์—ฐ๊ตฌ๋ฉ๋‹ˆ๋‹ค.

์ด ๋ณ€ํ˜•๋“ค์€ ๊ณตํ†ต์ ์œผ๋กœ ์ œ์•ˆ์€ ๋” ์‹ธ๊ฒŒ, ๊ฒ€์ฆ์€ ๊ฐ€๋Šฅํ•œ ํ•œ ์žฌํ™œ์šฉํ•˜๋Š” ์ชฝ์„ ๋…ธ๋ฆฝ๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๊ฐ ๋ฐฉ์‹์€ ํ•™์Šต ํ•„์š”์„ฑ, acceptance ํŠน์„ฑ, KV ๊ณต์œ  ๊ตฌ์กฐ๊ฐ€ ๋‹ฌ๋ผ ์„œ๋น™ ๊ด€์ ์˜ ์žฅ๋‹จ์ ๋„ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

Speculative Decoding Analysis

๊ทธ๋ฆผ 4. ๋‘ ๋ชจ๋ธ ๋ฐฉ์‹, self-speculation, multi-head ๋ฐฉ์‹์˜ ๋น„๊ต

7. ๊ด€๋ จ ๊ธฐ์ˆ 

์‹ค์ „์—์„œ๋Š” speculative decoding๋„ batch scheduler์™€ ํ•จ๊ป˜ ๋Œ์•„๊ฐ€์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์–ด๋–ค ์š”์ฒญ์€ draft ์ค‘์ด๊ณ , ์–ด๋–ค ์š”์ฒญ์€ verify ์ค‘์ด๋ฉฐ, ์–ด๋–ค ์š”์ฒญ์€ reject๋กœ ์ธํ•ด fallback decode๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ active set์˜ ์ƒํƒœ๊ฐ€ ๋” ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ throughput์„ ๋†’์ด๋ ค๋ฉด draft executor, verifier executor, KV manager, sampler๋ฅผ ํ•˜๋‚˜์˜ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๋ฌถ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ verify์™€ fallback์ด ๋‹ค๋ฅธ latency ํŠน์„ฑ์„ ๊ฐ€์ง€๋ฏ€๋กœ continuous batching ์ •์ฑ…๊ณผ ์ถฉ๋Œํ•˜๊ฑฐ๋‚˜ ์‹œ๋„ˆ์ง€๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Speculative Decoding Analysis

๊ทธ๋ฆผ 5. speculative decoding์ด ๋“ค์–ด๊ฐ„ ์„œ๋น™ ์Šคํƒ

์ž๋ฃŒ ํ•ต์‹ฌ
Continuous Batching Analysis verify/fallback์„ ํฌํ•จํ•œ ๋ฐฐ์น˜ ์Šค์ผ€์ค„๋ง์˜ ๊ธฐ๋ฐ˜
PagedAttention Analysis accepted prefix๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋‹ด๋Š” KV block ๊ด€๋ฆฌ
KV Cache Offloading Analysis KV budget ํ™•์žฅ๊ณผ long-context ์ฒ˜๋ฆฌ ๋ณด์™„
LLM Inference Scheduler Analysis queue, backpressure, admission์„ ํ•จ๊ป˜ ๋ณด๋Š” ์ƒ์œ„ ๊ด€์ 
Accelerating Large Language Model Decoding with Speculative Sampling speculative decoding ์› ๋…ผ๋ฌธ, T5X์—์„œ 2x-3x ๊ฐ€์† ๋ณด๊ณ 
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads multiple decoding heads, Medusa-1/2, 2.2x~3.6x ๋ณด๊ณ 
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty feature-level speculative sampling, 2.7x~3.5x ๋ณด๊ณ 

8. ํ•ต์‹ฌ ์ •๋ฆฌ

Speculative decoding์˜ ๋ณธ์งˆ์€ ์ž‘์€ ๋ชจ๋ธ์˜ ์ถ”์ •์„ ์ด์šฉํ•ด ํฐ ๋ชจ๋ธ์˜ step ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‹ค์ œ ์ด์ต์€ acceptance rate์™€ draft ๋น„์šฉ์˜ ํ•จ์ˆ˜์ด๋ฉฐ, ๋ชจ๋ธ๊ณผ ์›Œํฌ๋กœ๋“œ์— ๋”ฐ๋ผ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ด ๊ธฐ๋ฒ•์€ ๋‹จ์ผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ผ๊ธฐ๋ณด๋‹ค, draft ์„ค๊ณ„ยทverification ์ปค๋„ยทKV ๊ด€๋ฆฌยทbatch scheduler๊ฐ€ ํ•จ๊ป˜ ์ตœ์ ํ™”๋˜์–ด์•ผ ์˜๋ฏธ ์žˆ๋Š” ์‹œ์Šคํ…œ ๊ธฐ๋ฒ•์œผ๋กœ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ๋งž์Šต๋‹ˆ๋‹ค.

Speculative decoding์˜ ์†๋„ ํ–ฅ์ƒ์€ draft ๋น„์šฉ, acceptance rate, fallback ์ฒ˜๋ฆฌ, ๊ทธ๋ฆฌ๊ณ  ์Šค์ผ€์ค„๋Ÿฌ/KV ๊ด€๋ฆฌ๊ฐ€ ํ•จ๊ป˜ ๋งž๋ฌผ๋ฆด ๋•Œ ํ˜„์‹ค์ ์ธ ์ด๋“์œผ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์‹ค์ œ ์„œ๋น™์—์„œ๋Š” ๋‹จ์ผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต๋ณด๋‹ค๋„, ์–ด๋–ค speculative window์™€ verify ๊ฒฝ๋กœ๋ฅผ ์–ด๋–ค ๋ฐฐ์น˜ ์ •์ฑ… ์œ„์— ์–น์„ ๊ฒƒ์ธ์ง€๊ฐ€ ๋” ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.