Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

LLM Inference Scheduler Analysis

LLM Inference Scheduler ์‹ฌ์ธต ๋ถ„์„

Admission Control ยท Batching Policy ยท SLO Aware Queueing ยท Fairness ยท Token Budgeting

LLM inference scheduler๋Š” ๋‹จ์ˆœํžˆ ์š”์ฒญ ์ˆœ์„œ๋ฅผ ์ •ํ•˜๋Š” ํ ๊ด€๋ฆฌ์ž๊ฐ€ ์•„๋‹ˆ๋ผ, latency์™€ throughput, fairness์™€ memory budget ์‚ฌ์ด์˜ ์ถฉ๋Œ์„ ๋งค step๋งˆ๋‹ค ์กฐ์ •ํ•˜๋Š” ์ œ์–ด๊ธฐ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ continuous batching, chunked prefill, KV block pool, ๋ฉ€ํ‹ฐํ…Œ๋„ŒํŠธ ์šฐ์„ ์ˆœ์œ„๊ฐ€ ๊ฒฐํ•ฉ๋œ ํ˜„๋Œ€ serving ์—”์ง„์—์„œ๋Š” ์Šค์ผ€์ค„๋Ÿฌ์˜ ํŒ๋‹จ์ด ๊ณง ์‚ฌ์šฉ์ž ์ฒด๊ฐ ์„ฑ๋Šฅ๊ณผ GPU ํ™œ์šฉ๋ฅ ์„ ๋™์‹œ์— ์ขŒ์šฐํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ฌธ์„œ๋Š” ๋ชฉํ‘œ ์ง€ํ‘œ, scheduling loop, ๋Œ€ํ‘œ ์ •์ฑ…, backpressure, ๊ทธ๋ฆฌ๊ณ  cluster-level ๋ฐฐ์น˜ ๊ด€์ ์—์„œ ์ด ๋ฌธ์ œ๋ฅผ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ ์Šค์ผ€์ค„๋Ÿฌ๋Š” ๋…๋ฆฝ๋œ ํ๊ฐ€ ์•„๋‹ˆ๋ผ router, runtime, memory manager์™€ ๋งž๋ฌผ๋ฆฐ ์ƒํƒœ ๊ธฐ๊ณ„์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค. ์–ด๋–ค ์š”์ฒญ์„ ์ง€๊ธˆ admitํ• ์ง€, ์–ด๋–ค ์š”์ฒญ์„ ๋‹ค์Œ token step์œผ๋กœ ๋„˜๊ธธ์ง€, ์–ด๋–ค prefill์„ chunk๋กœ ์ชผ๊ฐค์ง€, ์–ด๋–ค tenant๋ฅผ ๋จผ์ € ๋ณดํ˜ธํ• ์ง€๊นŒ์ง€ ํ•จ๊ป˜ ๊ฒฐ์ •ํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ์‹ค์ œ ๊ตฌํ˜„์—์„œ๋Š” ์ •์ฑ…๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ œ์–ด๊ฐ€ ๋ถ„๋ฆฌ๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

  • Admission control์€ ๋‹จ์ˆœํ•œ ์ž…์žฅ ํ—ˆ๊ฐ€๊ฐ€ ์•„๋‹ˆ๋ผ KV budget๊ณผ active sequence budget์„ ํ•จ๊ป˜ ๋ณด๋Š” ๋ฉ”๋ชจ๋ฆฌ ๊ฒŒ์ดํŠธ์ž…๋‹ˆ๋‹ค.
  • Batching policy๋Š” prefill๊ณผ decode๋ฅผ ์–ด๋–ค ๋น„์œจ๊ณผ ์ˆœ์„œ๋กœ ์„ž์„์ง€ ์ •ํ•˜๋Š” ๊ทœ์น™์ด๋ฉฐ, utilization๊ณผ latency์˜ ๊ท ํ˜•์„ ์ขŒ์šฐํ•ฉ๋‹ˆ๋‹ค.
  • Backpressure๋Š” active decode, prefill burst, remote KV ์ง€์—ฐ์ด ์ƒ์œ„ ํ๋กœ ์ „ํŒŒ๋˜๋Š” ํ˜„์ƒ์ด๋ฉฐ, ์šด์˜ ํ’ˆ์งˆ์„ ์ง์ ‘ ํ”๋“ญ๋‹ˆ๋‹ค.
  • Fairness์™€ SLO๋Š” priority, aging, deadline-like rule๋กœ ๊ตฌํ˜„๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ณ , ๋ฉ€ํ‹ฐํ…Œ๋„ŒํŠธ ํ™˜๊ฒฝ์—์„œ๋Š” ํ•„์ˆ˜ ์ œ์•ฝ์ž…๋‹ˆ๋‹ค.

1. ์Šค์ผ€์ค„๋Ÿฌ๋Š” ๋ฌด์—‡์„ ๋™์‹œ์— ์ตœ์ ํ™”ํ•˜๋Š”๊ฐ€

์ถ”๋ก  ์Šค์ผ€์ค„๋Ÿฌ๋Š” TTFT, ITL, tail latency ๊ฐ™์€ ์‚ฌ์šฉ์ž ์ง€ํ‘œ์™€ throughput, GPU utilization, KV ์ ์œ ์œจ ๊ฐ™์€ ์‹œ์Šคํ…œ ์ง€ํ‘œ๋ฅผ ๋™์‹œ์— ๋ณธ๋‹ค๋Š” ์ ์—์„œ ๋‹จ์ˆœ FCFS ํ์™€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ fairness, tenant priority, active sequence ์ƒํ•œ ๊ฐ™์€ ์šด์˜ ์ œ์•ฝ๋„ ํ•จ๊ป˜ ๋งŒ์กฑ์‹œ์ผœ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ฆ‰ ์Šค์ผ€์ค„๋Ÿฌ๋Š” '๋ˆ„๊ตฌ๋ฅผ ๋จผ์ € ๋Œ๋ฆด์ง€'๋งŒ ์ •ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์–ธ์ œ admitํ• ์ง€, ์–ด๋А ๋ฐฐ์น˜์— ๋„ฃ์„์ง€, prefill๊ณผ decode๋ฅผ ์–ด๋–ค ๋น„์œจ๋กœ ์„ž์„์ง€๊นŒ์ง€ ํฌํ•จํ•œ ์ œ์–ด ๋ฌธ์ œ๋ฅผ ํ’€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

LLM Inference Scheduler Analysis

๊ทธ๋ฆผ 1. latency, efficiency, policy, system reality๋ฅผ ๋™์‹œ์— ๋งž์ถ”๋Š” scheduler objective

2. ์‹ค์ œ ๋™์ž‘์€ ingest -> admit -> batch -> refill์˜ ๋ฐ˜๋ณต์ด๋‹ค

์š”์ฒญ์€ ingress์—์„œ tenant, ๋ชจ๋ธ, SLO ์ •๋ณด๋ฅผ ๊ฐ€์ง„ ์ฑ„ ํ์— ๋“ค์–ด์˜ค๊ณ , admission gate๋Š” KV budget๊ณผ active sequence budget์„ ๋ณด๊ณ  ์š”์ฒญ์„ ๋ฐ›์•„๋“ค์ผ์ง€ ์ง€์—ฐ์‹œํ‚ฌ์ง€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ runtime์€ prefill ๋˜๋Š” decode ๋ฐฐ์น˜๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ , step์ด ๋๋‚  ๋•Œ๋งˆ๋‹ค ๋นˆ slot์„ ๋‹ค์‹œ ์ฑ„์›๋‹ˆ๋‹ค.

์ค‘์š”ํ•œ ์ ์€ ์ด ๋ฃจํ”„๊ฐ€ ์š”์ฒญ ๋‹จ์œ„๊ฐ€ ์•„๋‹ˆ๋ผ token step ๋‹จ์œ„๋กœ ๋งค์šฐ ์ž์ฃผ ๋ฐ˜๋ณต๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ž‘์€ ์ •์ฑ… ์ฐจ์ด๋„ TTFT์™€ ITL, ์‹ฌ์ง€์–ด tail latency์— ํฌ๊ฒŒ ๋ˆ„์ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

LLM Inference Scheduler Analysis

๊ทธ๋ฆผ 2. ingest, admit, batch, refill๋กœ ์ด์–ด์ง€๋Š” inference scheduler loop

Admission์€ compute๋ณด๋‹ค memory ๋ฌธ์ œ์— ๊ฐ€๊น๋‹ค

ํ˜„๋Œ€ LLM serving์—์„œ๋Š” admit ์—ฌ๋ถ€๊ฐ€ ๋‹จ์ˆœํ•œ GPU slot ์ˆ˜๋ณด๋‹ค KV block pool๊ณผ prefix cache ์ƒํƒœ์— ๋” ๊ฐ•ํ•˜๊ฒŒ ์ขŒ์šฐ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์‹คํ–‰๊ธฐ์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์ž๋ฅผ ๋ถ„๋ฆฌํ•ด ๋ณผ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

3. ๋Œ€ํ‘œ ์ •์ฑ…์€ ๋‹จ์ˆœ์„ฑ, ํšจ์œจ, ์šฐ์„ ์ˆœ์œ„ ์ œ์–ด ์‚ฌ์ด์˜ ์„ ํƒ์ด๋‹ค

์ •์  ๋ฐฐ์น˜๋‚˜ FCFS๋Š” ๊ตฌํ˜„์ด ๋‹จ์ˆœํ•˜๊ณ  ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ๊ธธ์ด์˜ ์š”์ฒญ์ด ์„ž์ด๋ฉด GPU์™€ ๋ฉ”๋ชจ๋ฆฌ ๋นˆ์นธ์ด ์ž์ฃผ ์ƒ๊น๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ continuous batching์€ utilization์„ ๋†’์ด์ง€๋งŒ fairness๋ฅผ ๋ณ„๋„๋กœ ๊ด€๋ฆฌํ•ด์•ผ ํ•˜๋ฉฐ, queue ์ƒํƒœ ๋ณ€ํ™”์— ๋” ๋ฏผ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์šฐ์„ ์ˆœ์œ„ ๊ธฐ๋ฐ˜ ๋˜๋Š” SLO-aware ์ •์ฑ…์€ ๊ธ‰ํ•œ ์š”์ฒญ์„ ๊ตฌ์ œํ•˜๋Š” ๋ฐ ์œ ๋ฆฌํ•˜์ง€๋งŒ, ์ž˜๋ชป ์„ค๊ณ„ํ•˜๋ฉด ๋‚ฎ์€ ์šฐ์„ ์ˆœ์œ„ ์š”์ฒญ์ด ์˜ค๋ž˜ ๋ฐ€๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋งŽ์€ ์‹œ์Šคํ…œ์ด FCFS, priority, aging, deadline-like ์š”์†Œ๋ฅผ ์ ˆ์ถฉํ•ด์„œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋น„๊ต/๋ถ„์„

์ •์ฑ… ์ž˜ ๋งž๋Š” ์ƒํ™ฉ ๊ฐ•์  ์•ฝ์ 
FCFS / static batch ์š”์ฒญ ๊ธธ์ด๊ฐ€ ๋น„์Šทํ•˜๊ณ  ๋ถ€ํ•˜๊ฐ€ ๋‚ฎ์„ ๋•Œ ๊ตฌํ˜„์ด ๋‹จ์ˆœํ•˜๊ณ  ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๋‹ค GPU ๋นˆ์นธ์ด ์ƒ๊ธฐ๊ธฐ ์‰ฝ๊ณ  utilization์ด ๋‚ฎ๋‹ค
Continuous batching mixed workload, ๊ธด ๋Œ€๊ธฐ์—ด, decode ์ค‘์‹ฌ ์„œ๋น„์Šค slot refill๋กœ ํ™œ์šฉ๋ฅ ์„ ๋†’์ธ๋‹ค fairness์™€ queue stability๋ฅผ ๋”ฐ๋กœ ๊ด€๋ฆฌํ•ด์•ผ ํ•œ๋‹ค
Chunked prefill + priority ๊ธด prefill๊ณผ ์งง์€ decode๊ฐ€ ์„ž์ธ ์„œ๋น„์Šค TTFT์™€ ITL์„ ํ•จ๊ป˜ ๋‹ค๋ฃจ๊ธฐ ์ข‹๋‹ค chunk ํฌ๊ธฐ์™€ ์šฐ์„ ์ˆœ์œ„ ์กฐํ•ฉ์ด ๋ฏผ๊ฐํ•˜๋‹ค
SLO-aware priority / aging ๋ฉ€ํ‹ฐํ…Œ๋„ŒํŠธ ์„œ๋น„์Šค, deadline-sensitive ์š”์ฒญ ๊ธ‰ํ•œ ์š”์ฒญ์„ ๋ณดํ˜ธํ•˜๊ธฐ ์‰ฝ๋‹ค low-priority starvation ์œ„ํ—˜์ด ์žˆ๋‹ค
Cluster-level routing ๋ฉ€ํ‹ฐ GPU / ๋ฉ€ํ‹ฐ ๋…ธ๋“œ serving locality์™€ load balance๋ฅผ ํ•จ๊ป˜ ๋ณธ๋‹ค ๋ผ์šฐํŒ…๊ณผ ์ƒํƒœ ๊ด€๋ฆฌ๊ฐ€ ๋ณต์žกํ•˜๋‹ค
LLM Inference Scheduler Analysis

๊ทธ๋ฆผ 3. static batch, continuous batching, priority-aware policy์˜ ๋น„๊ต

4. Queueing๊ณผ Backpressure๋Š” ์šด์˜ ํ’ˆ์งˆ์„ ๊ฒฐ์ •ํ•œ๋‹ค

active decode๊ฐ€ block pool์„ ๊ฑฐ์˜ ๋‹ค ์“ฐ๋ฉด ์‹ ๊ทœ admit์ด ๋ง‰ํžˆ๊ณ , prefill burst๊ฐ€ ๊ณผ๋„ํ•˜๋ฉด decode ITL์ด ํ”๋“ค๋ฆฝ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— remote KV๋‚˜ ๋„คํŠธ์›Œํฌ ์ง€์—ฐ๊นŒ์ง€ ๊ฒน์น˜๋ฉด backpressure๊ฐ€ ์•ž๋‹จ ingress๊นŒ์ง€ ์ „ํŒŒ๋ฉ๋‹ˆ๋‹ค.

์ข‹์€ ์Šค์ผ€์ค„๋Ÿฌ๋Š” ๋‹จ์ˆœํžˆ ํ๋ฅผ ๋น„์šฐ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์–ด๋””์—์„œ ๋ณ‘๋ชฉ์ด ์‹œ์ž‘๋˜๋Š”์ง€ ์กฐ๊ธฐ์— ๊ฐ์ง€ํ•˜๊ณ  hold, reject, reroute, chunk resize ๊ฐ™์€ ์™„์ถฉ ์ „๋žต์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณ„์ธต์ด ์•ฝํ•˜๋ฉด GPU๋Š” ๋ฐ”์˜๊ฒŒ ๋ณด์—ฌ๋„ ์‚ฌ์šฉ์ž ๊ฒฝํ—˜์€ ๊ธ‰๊ฒฉํžˆ ๋‚˜๋น ์ง‘๋‹ˆ๋‹ค.

LLM Inference Scheduler Analysis

๊ทธ๋ฆผ 4. ingress queue, admission gate, active decode pool, prefill pool ์‚ฌ์ด์˜ backpressure ์ „ํŒŒ

5. ๋‹จ์ผ ๋…ธ๋“œ๊ฐ€ ์•„๋‹ˆ๋ผ ํด๋Ÿฌ์Šคํ„ฐ ์Šค์ผ€์ค„๋ง์œผ๋กœ ํ™•์žฅ๋œ๋‹ค

๋ฉ€ํ‹ฐ GPU ๋˜๋Š” ๋ฉ€ํ‹ฐ ๋…ธ๋“œ serving์—์„œ๋Š” front router๊ฐ€ ๋ชจ๋ธ/ํ…Œ๋„ŒํŠธ/affinity๋ฅผ ๋ณด๊ณ  ์š”์ฒญ์„ ๋ถ„์‚ฐํ•˜๊ณ , global scheduler๊ฐ€ placement์™€ policy๋ฅผ ๊ฒฐ์ •ํ•œ ๋’ค, ๊ฐ node runtime์ด local continuous batching์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ ์ „์—ญ ์ •์ฑ…๊ณผ ๋กœ์ปฌ ์ •์ฑ…์ด ๊ณ„์ธต์ ์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค.

ํŠนํžˆ disaggregated serving์ด๋‚˜ MoE serving์ฒ˜๋Ÿผ ๋„คํŠธ์›Œํฌ์™€ remote state๊ฐ€ ์ค‘์š”ํ•œ ๊ตฌ์กฐ์—์„œ๋Š” global scheduler๊ฐ€ node ํ˜ผ์žก๋„, KV locality, fabric ์ƒํƒœ๋ฅผ ํ•จ๊ป˜ ๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์‚ฌ์‹ค์ƒ ๋ถ„์‚ฐ ์‹œ์Šคํ…œ ์ œ์–ด๊ธฐ ์—ญํ• ์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

LLM Inference Scheduler Analysis

๊ทธ๋ฆผ 5. front router, global scheduler, node runtime, execution resource๋กœ ์ด์–ด์ง€๋Š” cluster-level scheduler stack

6. ์žฅ๋‹จ์ 

์žฅ์ ์€ scheduler๊ฐ€ ๋‹จ์ˆœ ๋ฐฐ์น˜๊ธฐ๋ณด๋‹ค ํ›จ์”ฌ ์ •๊ตํ•˜๊ฒŒ ์ž์›์„ ์“ธ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. continuous batching, chunked prefill, prefix cache, KV budget์„ ํ•จ๊ป˜ ๋ณด๋ฉด ๊ฐ™์€ GPU์—์„œ๋„ ๋” ๋†’์€ utilization๊ณผ ๋” ๋‚ฎ์€ tail latency๋ฅผ ๋™์‹œ์— ๋…ธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฉ€ํ‹ฐํ…Œ๋„ŒํŠธ ํ™˜๊ฒฝ์—์„œ๋Š” priority, aging, quota๋ฅผ ํ†ตํ•ด ํŠน์ • ๊ณ ๊ฐ์ด๋‚˜ ์„œ๋น„์Šค ๋“ฑ๊ธ‰์„ ๋ณดํ˜ธํ•˜๊ธฐ๋„ ์‰ฝ์Šต๋‹ˆ๋‹ค.

๋ฐ˜๋ฉด ๋‹จ์ ์€ ์ •์ฑ… ๋ณต์žก๋„๊ฐ€ ๋น ๋ฅด๊ฒŒ ์ปค์ง„๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. decode๋ฅผ ์šฐ๋Œ€ํ•˜๋ฉด ๊ธด prefill์ด ๋ฐ€๋ฆฌ๊ณ , aggressive priority๋ฅผ ์“ฐ๋ฉด starvation์ด ์ƒ๊ธฐ๋ฉฐ, memory pressure๋ฅผ ๋ณด์ˆ˜์ ์œผ๋กœ ์žก์œผ๋ฉด throughput์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ๋ถ„์‚ฐ ํ™˜๊ฒฝ์œผ๋กœ ๊ฐˆ์ˆ˜๋ก router, node runtime, remote KV store์˜ ์ƒํƒœ๊นŒ์ง€ ์—ฎ์ด๋ฏ€๋กœ ์›์ธ ๋ถ„์„๊ณผ ํŠœ๋‹๋„ ์–ด๋ ค์›Œ์ง‘๋‹ˆ๋‹ค.

๊ด€๋ จ ๊ธฐ์ˆ 

์ž๋ฃŒ ๋งํฌ ์—ฐ๊ฒฐ์ 
Continuous Batching Analysis llm_0065_continuous_batching_analysis.html token-level slot refill๊ณผ chunked prefill
Disaggregated LLM Serving Analysis llm_0080_disaggregated_llm_serving_analysis.html prefill/decode ๋ถ„๋ฆฌ์™€ KV handoff
Prefix Caching Analysis llm_0045_prefix_caching_analysis.html shared prefix ์žฌ์‚ฌ์šฉ๊ณผ admission ์ƒํ˜ธ์ž‘์šฉ
Speculative Decoding Analysis llm_0070_speculative_decoding_analysis.html draft/verify ๋‹จ๊ณ„๊ฐ€ scheduler์— ์ฃผ๋Š” ์˜ํ–ฅ
vLLM https://github.com/vllm-project/vllm continuous batching๊ณผ KV ๊ด€๋ฆฌ ๊ตฌํ˜„์ฒด
TensorRT-LLM https://nvidia.github.io/TensorRT-LLM/ serving runtime๊ณผ ๋ฐฐ์น˜ ์ตœ์ ํ™” ์ฐธ๊ณ 
Sarathi-Serve https://arxiv.org/abs/2403.02310 chunked prefill๊ณผ stall-free batching

7. ํ•ต์‹ฌ ์ •๋ฆฌ

LLM inference scheduler์˜ ๋ณธ์งˆ์€ batch๋ฅผ ์˜ˆ์˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ œํ•œ๋œ memory์™€ compute ์•ˆ์—์„œ latency, throughput, fairness๋ฅผ ๋™์‹œ์— ๊ด€๋ฆฌํ•˜๋Š” ์˜จ๋ผ์ธ ์ œ์–ด ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ข‹์€ ์ •์ฑ…์€ ์ด๋ก ์ ์œผ๋กœ๋งŒ ๊น”๋”ํ•œ ๊ทœ์น™๋ณด๋‹ค, ์‹ค์ œ block pool๊ณผ queue dynamics๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•˜๋Š” ๊ทœ์น™์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ตญ scheduler๋Š” serving ์—”์ง„์˜ ์ฃผ๋ณ€ ๊ธฐ๋Šฅ์ด ์•„๋‹ˆ๋ผ ์ค‘์‹ฌ๋ถ€์ž…๋‹ˆ๋‹ค. continuous batching, prefix caching, remote KV, ๋ฉ€ํ‹ฐํ…Œ๋„ŒํŠธ ์ •์ฑ…๊นŒ์ง€ ๋ชจ๋‘ ์Šค์ผ€์ค„๋Ÿฌ๋ฅผ ํ†ตํ•ด ํ˜„์‹ค์ ์ธ ํ’ˆ์งˆ๋กœ ์—ฐ๊ฒฐ๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.