Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

Continuous Batching Analysis

Continuous Batching ์‹ฌ์ธต ๋ถ„์„

Dynamic Scheduler ยท Decode Slot Refill ยท Chunked Prefill ยท PagedAttention ยท Fairness vs Throughput

Continuous batching์€ ๋””์ฝ”๋”ฉ ๋ฐฐ์น˜๋ฅผ 'ํ•œ ๋ฒˆ ๊ณ ์ •ํ•˜๊ณ  ๋๋‚  ๋•Œ๊นŒ์ง€ ์œ ์ง€'ํ•˜๋Š” ๋Œ€์‹ , ํ† ํฐ์ด ๋๋‚˜๋Š” ์ˆœ๊ฐ„๋งˆ๋‹ค ๋นˆ ์Šฌ๋กฏ์„ ์ƒˆ ์š”์ฒญ์œผ๋กœ ์ฆ‰์‹œ ์ฑ„์šฐ๋Š” ๋™์  ์Šค์ผ€์ค„๋ง ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋””์ฝ”๋”ฉ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ธ LLM ์„œ๋น™์—์„œ๋Š” ์ด ๋‹จ์ˆœํ•œ ์ •์ฑ… ์ฐจ์ด๊ฐ€ GPU ํ™œ์šฉ๋„์™€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ํฌ๊ฒŒ ๋ฐ”๊พธ๋ฉฐ, ์˜ค๋Š˜๋‚  vLLM๋ฅ˜ ์—”์ง„์˜ ๊ธฐ๋ณธ ์šด์˜ ๋ชจ๋“œ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ORCA๋Š” iteration-level scheduling๊ณผ selective batching์œผ๋กœ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ๋จผ์ € ํ’€์—ˆ๊ณ , vLLM์€ chunked prefill๋กœ decode ์šฐ์„  ์ •์ฑ…์„ ์กฐ์ •ํ•˜๋ฉฐ, TensorRT-LLM์€ in-flight batching๊ณผ chunked context๋กœ ์ด๋ฅผ ์ œํ’ˆํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์„œ๋Š” ์ •์  ๋ฐฐ์น˜์˜ ๋‚ญ๋น„ โ†’ ์—ฐ์† ๋ฐฐ์น˜ ๋ฃจํ”„ โ†’ prefill/decode ๊ฐ„์„ญ โ†’ ๋น„๊ต/๋ถ„์„ โ†’ ์žฅ๋‹จ์  โ†’ ๊ด€๋ จ ๊ธฐ์ˆ ์˜ ์ˆœ์„œ๋กœ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

์‹ค์ œ ์šด์˜์—์„œ๋Š” ์Šค์ผ€์ค„๋Ÿฌ๋งŒ ์ž˜ ์ง ๋‹ค๊ณ  ๋๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋นˆ ์Šฌ๋กฏ์„ ๋‹ค์‹œ ์ฑ„์šฐ๋Š” ์†๋„์™€ ํ•จ๊ป˜ KV cache๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ํšŒ์ „์‹œํ‚ค๋Š”์ง€, ๊ธด prefill์„ ์–ผ๋งˆ๋‚˜ ์ž˜๊ฒŒ ์ชผ๊ฐœ decode์™€ ๊ณต์กด์‹œํ‚ค๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  SLA๊ฐ€ ๋‹ค๋ฅธ ์š”์ฒญ ์‚ฌ์ด์—์„œ fairness๋ฅผ ์–ด๋–ป๊ฒŒ ์œ ์ง€ํ•˜๋Š”์ง€๊ฐ€ ์ „์ฒด ์ฒด๊ฐ ์„ฑ๋Šฅ์„ ์ขŒ์šฐํ•ฉ๋‹ˆ๋‹ค.

1. ์™œ ์ •์  ๋ฐฐ์น˜๋กœ๋Š” ๋””์ฝ”๋”ฉ GPU๋ฅผ ์ถฉ๋ถ„ํžˆ ์ฑ„์šฐ์ง€ ๋ชปํ•˜๋Š”๊ฐ€

๋””์ฝ”๋”ฉ์€ ์‹œํ€€์Šค๋งˆ๋‹ค ๋งค ์Šคํ… ํ•œ ํ† ํฐ์”ฉ๋งŒ ์ „์ง„ํ•˜๋ฏ€๋กœ, ์–ด๋–ค ์š”์ฒญ์€ ๋นจ๋ฆฌ ๋๋‚˜๊ณ  ์–ด๋–ค ์š”์ฒญ์€ ์˜ค๋ž˜ ๋‚จ์Šต๋‹ˆ๋‹ค. ์ •์  ๋ฐฐ์น˜๋Š” ํ•œ ๋ฒˆ ๋ฐฐ์น˜๋ฅผ ๋ฌถ์€ ๋’ค ๋๋‚  ๋•Œ๊นŒ์ง€ ์œ ์ง€ํ•˜๋ฏ€๋กœ, ์งง์€ ์š”์ฒญ์ด ๋จผ์ € ๋๋‚œ ์ž๋ฆฌ๋Š” ๋ฐฐ์น˜ ์ข…๋ฃŒ ์ „๊นŒ์ง€ ๋น„์–ด ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด ๋นˆ ์Šฌ๋กฏ์€ ๊ณง GPU ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์˜ ๋‚ญ๋น„์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ์˜จ๋ผ์ธ ์„œ๋น™์ฒ˜๋Ÿผ ์š”์ฒญ ๊ธธ์ด๊ฐ€ ํฌ๊ฒŒ ๋‹ค๋ฅผ ๋•Œ๋Š” ๋ฐฐ์น˜ ๋‚ด๋ถ€์˜ tail request ํ•˜๋‚˜๊ฐ€ ์ „์ฒด ๋ฐฐ์น˜๋ฅผ ์˜ค๋ž˜ ๋ถ™์žก์•„ GPU utilization์„ ๋–จ์–ด๋œจ๋ฆฝ๋‹ˆ๋‹ค.

llm_0065_continuous_batching_analysis

๊ทธ๋ฆผ 1. ์ •์  ๋ฐฐ์น˜์™€ ์—ฐ์† ๋ฐฐ์น˜์˜ ์Šฌ๋กฏ ์žฌํ™œ์šฉ ์ฐจ์ด

Decode๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ

๋””์ฝ”๋”ฉ ๋‹จ๊ณ„๋Š” ์ด๋ฏธ ๊ณ„์‚ฐ๋œ KV ์บ์‹œ๋ฅผ ์ฝ์–ด ๋‹ค์Œ ํ† ํฐ ํ•˜๋‚˜๋ฅผ ์ƒ์„ฑํ•˜๋ฏ€๋กœ, FLOP๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์ด๋™์ด ๋ณ‘๋ชฉ์ด ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ™์€ ์‹œ๊ฐ„์— ๋” ๋งŽ์€ ์‹œํ€€์Šค๋ฅผ active set์— ๋‹ด์•„๋‘๋Š” ๊ฒƒ์ด ์ฒ˜๋ฆฌ๋Ÿ‰๊ณผ ๊ฑฐ์˜ ์ง๊ฒฐ๋ฉ๋‹ˆ๋‹ค.

2. ํ•ต์‹ฌ ๊ฐœ๋…

  • ์š”์ฒญ(request) / ์‹œํ€€์Šค(sequence): ์„œ๋ฒ„ ์ž…์žฅ์—์„œ๋Š” ํ•œ ํด๋ผ์ด์–ธํŠธ ์š”์ฒญ์ด ํ•˜๋‚˜์˜ ์‹œํ€€์Šค์ž…๋‹ˆ๋‹ค.
  • ready queue: ์•„์ง ์‹คํ–‰๋˜์ง€ ์•Š์•˜๊ฑฐ๋‚˜ ๋‹ค์Œ ๊ธฐํšŒ๋ฅผ ๊ธฐ๋‹ค๋ฆฌ๋Š” ์š”์ฒญ ์ง‘ํ•ฉ์ž…๋‹ˆ๋‹ค.
  • active set: ํ˜„์žฌ decode step์— ์ฐธ์—ฌ ์ค‘์ธ ์š”์ฒญ ์ง‘ํ•ฉ์ž…๋‹ˆ๋‹ค.
  • slot refill: ์™„๋ฃŒ๋œ ์š”์ฒญ์ด ๋น ์ง„ ์ž๋ฆฌ๋กœ ์ƒˆ ์š”์ฒญ์„ ์ฆ‰์‹œ ์ฑ„์›Œ ๋„ฃ๋Š” ๋™์ž‘์ž…๋‹ˆ๋‹ค.
  • token budget: ํ•œ ๋ฒˆ์˜ iteration์— ๋„ฃ์„ ์ˆ˜ ์žˆ๋Š” ํ† ํฐ ์ˆ˜ ์ œํ•œ์ž…๋‹ˆ๋‹ค(max_num_batched_tokens, max_num_tokens).
  • KV budget: ์ƒˆ ์š”์ฒญ์„ ๋ฐ›์•„๋“ค์ผ ์ˆ˜ ์žˆ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์—ฌ์œ ์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ตญ admission์€ compute๋ณด๋‹ค memory ๋ฌธ์ œ์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค.
  • fairness: ์งง์€ ์š”์ฒญ์„ ๊ณ„์† ์šฐ๋Œ€ํ•˜๋ฉด ๊ธด ์š”์ฒญ์ด ๋ฐ€๋ฆฌ๊ณ , ํฐ ์š”์ฒญ์„ ์œ ์ง€ํ•˜๋ฉด ์งง์€ ์š”์ฒญ์˜ TTFT๊ฐ€ ๋‚˜๋น ์ง‘๋‹ˆ๋‹ค.

3. Continuous Batching์˜ ํ•ต์‹ฌ ๋ฃจํ”„

์—ฐ์† ๋ฐฐ์น˜๋Š” ๊ฐ decode step ์ดํ›„ finished request๋ฅผ ์ œ๊ฑฐํ•˜๊ณ , ๋น„์–ด ์žˆ๋Š” slot ์ˆ˜๋งŒํผ ready queue์—์„œ ์ƒˆ ์š”์ฒญ์„ ๋Œ์–ด์™€ active set์„ ๋ณด์ถฉํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ ๋ฐฐ์น˜๊ฐ€ '์„ธ์…˜ ๋‹จ์œ„'๊ฐ€ ์•„๋‹ˆ๋ผ 'ํ† ํฐ ์Šคํ… ๋‹จ์œ„'๋กœ ์žฌ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

์ด๋•Œ ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์–ด๋–ค ์š”์ฒญ์„ ๋‚จ๊ธฐ๊ณ  ์–ด๋–ค ์š”์ฒญ์„ ๋“ค์ผ์ง€, prefill์„ ์ง€๊ธˆ ์ˆ˜ํ–‰ํ• ์ง€ ๋‹ค์Œ ํ„ด์œผ๋กœ ๋ฏธ๋ฃฐ์ง€, active sequence budget๊ณผ KV cache budget์„ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆŒ์ง€๋ฅผ ๋งค๋ฒˆ ํŒ๋‹จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  1. ํ˜„์žฌ iteration์˜ decode๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  2. ์™„๋ฃŒ๋œ ์š”์ฒญ์„ ์ฆ‰์‹œ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
  3. ๋‚จ์€ budget์„ ๊ณ„์‚ฐํ•ด ready queue์—์„œ ์ƒˆ ์š”์ฒญ์„ ๋ฝ‘์Šต๋‹ˆ๋‹ค.
  4. prefill์€ ํ•„์š”ํ•˜๋ฉด chunk๋กœ ๋‚˜๋ˆ  decode ์‚ฌ์ด์— ๋ผ์›Œ ๋„ฃ์Šต๋‹ˆ๋‹ค.
  5. ๋‹ค์Œ iteration์—์„œ ๋‹ค์‹œ ๊ฐ™์€ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
Continuous Batching Analysis

๊ทธ๋ฆผ 2. ready queue -> active set -> finished slot refill๋กœ ์ด์–ด์ง€๋Š” ์Šค์ผ€์ค„๋Ÿฌ ๋ฃจํ”„

Queue์™€ Active Set

ํ•ต์‹ฌ ์ƒํƒœ๋Š” ready queue, active decode set, ๊ทธ๋ฆฌ๊ณ  free slot/KV block pool์ž…๋‹ˆ๋‹ค. ์š”์ฒญ admission์€ ๊ฒฐ๊ตญ 'compute slot'์ด ์•„๋‹ˆ๋ผ 'KV cache๋ฅผ ๊ฐ๋‹นํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€'์˜ ๋ฌธ์ œ์™€ ๊ฒฐํ•ฉ๋˜๋ฏ€๋กœ, ๋ฐฐ์น˜ ์Šค์ผ€์ค„๋Ÿฌ๋Š” ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์ž์™€ ๋ถ„๋ฆฌ๋  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

4. Prefill๊ณผ Decode๋ฅผ ์„ž์„ ๋•Œ์˜ ๋ฌธ์ œ

๊ธด ํ”„๋กฌํ”„ํŠธ์˜ prefill์€ ์—ฐ์‚ฐ์ง‘์•ฝ์ ์ด๊ณ , decode๋Š” ๋ฉ”๋ชจ๋ฆฌ์ง‘์•ฝ์ ์ž…๋‹ˆ๋‹ค. ๋‘˜์„ ๊ฐ™์€ ๋ฐฐ์น˜ ์•ˆ์— ์„ž์œผ๋ฉด TTFT๋Š” ์ข‹์•„์งˆ ์ˆ˜ ์žˆ์ง€๋งŒ, ํฐ prefill์ด ๋””์ฝ”๋”ฉ์˜ inter-token latency๋ฅผ ํ”๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ตœ์‹  ์—”์ง„๋“ค์€ ๊ธด prefill์„ ์—ฌ๋Ÿฌ chunk๋กœ ๋‚˜๋ˆ„์–ด decode step ์‚ฌ์ด์— ๋ผ์›Œ ๋„ฃ๋Š” chunked prefill์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” throughput๋งŒ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ๋Œ€์‹  latency๋„ ํ•จ๊ป˜ ์ œ์–ดํ•˜๋ ค๋Š” ํƒ€ํ˜‘์ž…๋‹ˆ๋‹ค.

Chunked Prefill / Chunked Context

  • vLLM์€ chunked prefill๋กœ ํฐ prefill์„ ์—ฌ๋Ÿฌ chunk๋กœ ๋‚˜๋ˆ  decode์™€ ๊ฐ™์€ batch์— ๋„ฃ์Šต๋‹ˆ๋‹ค.
  • TensorRT-LLM์€ chunked context๋ฅผ ํ†ตํ•ด context phase๋ฅผ ์—ฌ๋Ÿฌ iteration์œผ๋กœ ๋‚˜๋ˆ„๊ณ  generation phase์™€ ํ•จ๊ป˜ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • ๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ ๊ธด ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ „์ฒด ๋ฐฐ์น˜๋ฅผ ์˜ค๋ž˜ ์ ์œ ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ์ค„์ด๊ณ , mixed workload์—์„œ ITL์„ ์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค.
Continuous Batching Analysis

๊ทธ๋ฆผ 3. giant prefill๊ณผ chunked prefill์˜ ์ฐจ์ด

5. ๋น„๊ต/๋ถ„์„

Continuous batching์€ ์ผ๋ฐ˜์ ์œผ๋กœ throughput(tokens/s)๊ณผ GPU utilization์„ ํฌ๊ฒŒ ์˜ฌ๋ฆฌ์ง€๋งŒ, ๊ทธ ๋Œ€๊ฐ€๋กœ ์Šค์ผ€์ค„๋ง ๋ณต์žก๋„์™€ fairness ๋ฌธ์ œ๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค. ์งง์€ ์š”์ฒญ์„ ๋„ˆ๋ฌด ์ž์ฃผ ์šฐ๋Œ€ํ•˜๋ฉด ๊ธด ์š”์ฒญ์ด ๊ตถ๊ณ , ๋ฐ˜๋Œ€๋กœ ํฐ ๋ฐฐ์น˜๋ฅผ ์œ ์ง€ํ•˜๋ ค๊ณ  ๊ธฐ๋‹ค๋ฆฌ๋ฉด TTFT๊ฐ€ ๋‚˜๋น ์ง‘๋‹ˆ๋‹ค.

์ฆ‰ continuous batching์€ ๋‹จ์ˆœํ•œ batching trick์ด ์•„๋‹ˆ๋ผ, throughputยทTTFTยทITLยทtail latency ์‚ฌ์ด์˜ multi-objective control ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. admission control, chunked prefill, prefix caching ๊ฐ™์€ ๋ณด์™„ ๊ธฐ๋ฒ•์ด ํ•จ๊ป˜ ํ•„์š”ํ•œ ์ด์œ ๊ฐ€ ์—ฌ๊ธฐ์— ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐฉ์‹ ์Šค์ผ€์ค„๋ง ๋‹จ์œ„ TTFT ITL / ์ฒ˜๋ฆฌ๋Ÿ‰ ํŠน์ง•
Static batching request ๋‹จ์œ„ ๋Œ€๊ธฐ ์‹œ๊ฐ„์ด ๊ธธ์–ด์ง€๊ธฐ ์‰ฝ๋‹ค ์Šฌ๋กฏ ๋‚ญ๋น„๊ฐ€ ์ปค์„œ ๋ถˆ๋ฆฌํ•˜๋‹ค ๊ตฌํ˜„์€ ๋‹จ์ˆœํ•˜์ง€๋งŒ tail request์— ์ทจ์•ฝํ•˜๋‹ค
Continuous batching token iteration ๋‹จ์œ„ ์ƒˆ ์š”์ฒญ์„ ๋นจ๋ฆฌ ๋ฐ›์•„๋“ค์ด๊ธฐ ์‰ฝ๋‹ค GPU ํ™œ์šฉ๋„๊ฐ€ ๋†’๋‹ค ์™„๋ฃŒ ์š”์ฒญ์˜ ์Šฌ๋กฏ์„ ์ฆ‰์‹œ ์žฌํ™œ์šฉํ•œ๋‹ค
Continuous batching + chunked prefill token iteration + chunk ๊ธด ํ”„๋กฌํ”„ํŠธ์—์„œ ๋” ์•ˆ์ •์ ์ด๋‹ค mixed workload์— ์œ ๋ฆฌํ•˜๋‹ค prefill/decode ์ถฉ๋Œ์„ ์ค„์ธ๋‹ค
์—”์ง„/๋…ผ๋ฌธ ๋ฐฐ์น˜ ์šด์˜ ๊ด€์  ๋ฉ”๋ชจ๋ฆฌ ์šด์˜ ๊ด€์  ์‹ค๋ฌด์  ํฌ์ธํŠธ
ORCA iteration-level scheduling, selective batching ์š”์ฒญ๋ณ„ ์ƒํƒœ๋ฅผ ์„ธ๋ฐ€ํ•˜๊ฒŒ ์ถ”์  ์˜จ๋ผ์ธ ์„œ๋น™์—์„œ ์ •์  ๋ฐฐ์น˜์˜ ๋น„ํšจ์œจ์„ ๋จผ์ € ๊ตฌ์กฐ์ ์œผ๋กœ ์ง€์ ํ–ˆ๋‹ค
vLLM continuous batching, chunked prefill PagedAttention ๊ธฐ๋ฐ˜ KV block ๊ด€๋ฆฌ ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์š”์ฒญ ๊ธธ์ด ํŽธ์ฐจ๋ฅผ ๊ฒฌ๋””๊ธฐ ์‰ฝ๋‹ค
TensorRT-LLM in-flight batching, chunked context ์‹คํ–‰๊ธฐ์™€ ์Šค์ผ€์ค„๋Ÿฌ๋ฅผ ํ†ตํ•ฉํ•ด ์ œํ’ˆํ™” ์šด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ(max_batch_size, max_num_tokens) ์กฐ์ •์ด ์ค‘์š”ํ•˜๋‹ค
Continuous Batching Analysis

๊ทธ๋ฆผ 4. throughput ๊ฐœ์„ ๊ณผ latency/fairness trade-off

6. ์žฅ๋‹จ์ 

์žฅ์  ๋‹จ์ 
๋นˆ ์Šฌ๋กฏ์„ ์ฆ‰์‹œ ์žฌ์‚ฌ์šฉํ•ด GPU utilization์„ ๋†’์ธ๋‹ค ์Šค์ผ€์ค„๋Ÿฌ ์ƒํƒœ๊ฐ€ ๋ณต์žกํ•ด์ง„๋‹ค
์งง์€ ์š”์ฒญ์˜ ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ค„์ด๊ธฐ ์‰ฝ๋‹ค ๊ธด ์š”์ฒญ๊ณผ ์งง์€ ์š”์ฒญ ์‚ฌ์ด์˜ fairness๋ฅผ ์กฐ์ •ํ•ด์•ผ ํ•œ๋‹ค
PagedAttention๊ณผ ๊ฒฐํ•ฉํ•˜๋ฉด KV cache ๋‹จํŽธํ™”๋ฅผ ์ค„์ด๊ธฐ ์ข‹๋‹ค prefill์ด ๊ธธ๋ฉด decode ์ง€์—ฐ์ด ํ”๋“ค๋ฆด ์ˆ˜ ์žˆ๋‹ค
chunked prefill๊ณผ ํ•จ๊ป˜ ์“ฐ๋ฉด mixed workload์— ๊ฐ•ํ•˜๋‹ค admission control๊ณผ memory budget ๊ด€๋ฆฌ๊ฐ€ ํ•„์ˆ˜๋‹ค

7. PagedAttention๊ณผ์˜ ๊ฒฐํ•ฉ

์—ฐ์† ๋ฐฐ์น˜๊ฐ€ ์ž˜ ๋™์ž‘ํ•˜๋ ค๋ฉด request slot๋งŒ์ด ์•„๋‹ˆ๋ผ KV cache๋„ ์„ธ๋ฐ€ํ•˜๊ฒŒ ์žฌํ™œ์šฉ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์งง์€ ์š”์ฒญ์ด ๋น ์ ธ๋„ ๊ทธ ์ž๋ฆฌ์— ๋‚จ์€ ์—ฐ์† ๋ฉ”๋ชจ๋ฆฌ ์กฐ๊ฐ์„ ์ƒˆ ์š”์ฒญ์ด ๋ฐ”๋กœ ์“ธ ์ˆ˜ ์—†์–ด ์™ธ๋ถ€ ๋‹จํŽธํ™”๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค.

PagedAttention์€ KV๋ฅผ ๊ณ ์ • ํฌ๊ธฐ block์œผ๋กœ ์ชผ๊ฐœ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , continuous batching์€ ๊ทธ block pool์„ ๋†’์€ ํšŒ์ „์œจ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ค๋Š˜๋‚  LLM ์„œ๋น™ ์—”์ง„์—์„œ ๋‘ ๊ธฐ๋ฒ•์ด ๊ฑฐ์˜ ํ•ญ์ƒ ํ•จ๊ป˜ ์–ธ๊ธ‰๋˜๋Š” ์ด์œ ์ž…๋‹ˆ๋‹ค.

Continuous Batching Analysis

๊ทธ๋ฆผ 5. ์—ฐ์† ๋ฐฐ์น˜ ์Šค์ผ€์ค„๋Ÿฌ์™€ KV ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์ž์˜ ๊ฒฐํ•ฉ

์ด ๊ฒฐํ•ฉ ๊ตฌ์กฐ๋Š” ์˜คํ”„๋กœ๋”ฉ์ด๋‚˜ prefix caching ๊ฐ™์€ ์ƒ์œ„ ๊ธฐ๋Šฅ์—๋„ ๊ทธ๋Œ€๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค. ์ฆ‰ continuous batching์€ ๋‹จ๋… ๊ธฐ๋ฒ•์ด๋ผ๊ธฐ๋ณด๋‹ค, KV block allocator์™€ admission policy ์œ„์—์„œ ์ž‘๋™ํ•˜๋Š” ์šด์˜ ๊ณ„์ธต์œผ๋กœ ๋ณด๋Š” ํŽธ์ด ์ •ํ™•ํ•ฉ๋‹ˆ๋‹ค.

8. ๊ด€๋ จ ๊ธฐ์ˆ 

์ž๋ฃŒ ํ•ต์‹ฌ
PagedAttention Analysis KV block ์žฌ์‚ฌ์šฉ๊ณผ ๋‹จํŽธํ™” ์™„ํ™”์˜ ๊ธฐ๋ฐ˜
KV Cache Offloading Analysis KV budget ํ™•์žฅ๊ณผ admission control์˜ ๋ณด์™„ ์ถ•
Prefix Caching Analysis prefill ์ค‘๋ณต์„ ์ค„์—ฌ continuous batching๊ณผ ๊ฒฐํ•ฉํ•˜๊ธฐ ์ข‹๋‹ค
Speculative Decoding Analysis decode ๋‹จ๊ณ„์˜ ํ† ํฐ ์ „์ง„์„ ๊ฐ€์†ํ•˜๋Š” ๋ณด์™„ ๊ธฐ๋ฒ•
LLM Inference Scheduler Analysis queue, backpressure, admission์„ ํ•จ๊ป˜ ๋ณด๋Š” ์ƒ์œ„ ๊ด€์ 
Orca: A Distributed Serving System for Transformer-Based Generative Models iteration-level scheduling, selective batching, 36.9x throughput improvement
vLLM Optimization and Tuning chunked prefill, preemption, max_num_batched_tokens ์กฐ์ •
TensorRT-LLM: Paged Attention, IFB, and Request Scheduling in-flight batching, chunked context, max_batch_size, max_num_tokens
Efficient Memory Management for Large Language Model Serving with PagedAttention KV cache fragmentation ์™„ํ™”์™€ throughput ํ–ฅ์ƒ
Continuous Batching Analysis

๊ทธ๋ฆผ 6. KV block ์ถ”์ƒํ™”๊ฐ€ ์˜คํ”„๋กœ๋”ฉ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ํ‹ฐ์–ด๋ง์œผ๋กœ ํ™•์žฅ๋˜๋Š” ๋ฐฉ์‹

Continuous Batching Analysis

๊ทธ๋ฆผ 7. ์‹ค์ „ ์„œ๋น™ ์Šคํƒ์—์„œ ์—ฐ์† ๋ฐฐ์น˜ ์Šค์ผ€์ค„๋Ÿฌ๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” ์œ„์น˜

9. ํ•ต์‹ฌ ์ •๋ฆฌ

Continuous batching์˜ ๋ณธ์งˆ์€ '๋ฐฐ์น˜๋ฅผ ์˜ค๋ž˜ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ'์ด ์•„๋‹ˆ๋ผ '๋นˆ ์ž๋ฆฌ๋ฅผ ํ† ํฐ ๋‹จ์œ„๋กœ ์ฆ‰์‹œ ๋‹ค์‹œ ์“ฐ๋Š” ๊ฒƒ'์ž…๋‹ˆ๋‹ค. ๋””์ฝ”๋”ฉ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ๋ผ๋Š” ์‚ฌ์‹ค ๋•Œ๋ฌธ์— ์ด ์ •์ฑ…์€ ์ฒ˜๋ฆฌ๋Ÿ‰์— ๋งค์šฐ ํฐ ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.

๋‹ค๋งŒ ์‹ค์ œ ์‹œ์Šคํ…œ์—์„œ๋Š” prefill ๊ฐ„์„ญ, fairness, active KV budget, prefix sharing๊นŒ์ง€ ํ•จ๊ป˜ ๋‹ค๋ค„์•ผ ํ•˜๋ฏ€๋กœ, continuous batching์€ ์Šค์ผ€์ค„๋Ÿฌ ํ•œ ์ค„์งœ๋ฆฌ ์ตœ์ ํ™”๊ฐ€ ์•„๋‹ˆ๋ผ ์„œ๋น™ ์—”์ง„์˜ ์šด์˜ ์ฒ ํ•™์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค. ORCA์˜ iteration-level scheduling, vLLM์˜ chunked prefill, TensorRT-LLM์˜ in-flight batching์€ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ์—”์ง„์—์„œ ํ’€์–ด๋‚ธ ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.

Continuous Batching Analysis

๊ทธ๋ฆผ 8. ์ •์  ๋ฐฐ์น˜๊ฐ€ tail request ๋•Œ๋ฌธ์— GPU๋ฅผ ๋†€๋ฆฌ๋Š” ์ด์œ 

Continuous Batching Analysis

๊ทธ๋ฆผ 9. ๋งค iteration๋งˆ๋‹ค ready queue์™€ active set์„ ์žฌ๊ตฌ์„ฑํ•˜๋Š” ์Šค์ผ€์ค„๋Ÿฌ ๋ฃจํ”„

Continuous Batching Analysis

๊ทธ๋ฆผ 10. giant prefill๊ณผ chunked prefill์˜ latency ์ฐจ์ด

Continuous Batching Analysis

๊ทธ๋ฆผ 11. ์—ฐ์† ๋ฐฐ์น˜๊ฐ€ ๊ฐœ์„ ํ•˜๋Š” ์ง€ํ‘œ์™€ ์ƒˆ๋กœ ์ƒ๊ธฐ๋Š” ์šด์˜ ๋น„์šฉ

ํ•ต์‹ฌ์€ throughput ํ•˜๋‚˜๋งŒ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, decode ์ค‘์‹ฌ ์›Œํฌ๋กœ๋“œ์—์„œ ์Šฌ๋กฏ ํšŒ์ „์œจ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ํšŒ์ „์œจ์„ ๋™์‹œ์— ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ continuous batching์„ ํ‰๊ฐ€ํ•  ๋•Œ๋Š” tokens/s๋งŒ์ด ์•„๋‹ˆ๋ผ TTFT, ITL, tail latency, KV ์‚ฌ์šฉ๋ฅ ์„ ํ•จ๊ป˜ ๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.