Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

vAttention Analysis

vAttention ์‹ฌ์ธต ๋ถ„์„

CUDA Virtual Memory ยท Contiguous KV Cache ยท Demand Paging ยท ASPLOS 2025

vAttention(Prabhu et al., ASPLOS 2025, arXiv 2405.04437)์€ KV cache๋ฅผ ์—ฐ์†๋œ ๊ฐ€์ƒ ๋ฉ”๋ชจ๋ฆฌ์— ๋‘๊ณ  ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ๋งŒ ํ•„์š”ํ•  ๋•Œ ๋งคํ•‘ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋™์  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. PagedAttention์ฒ˜๋Ÿผ KV๋ฅผ ๋ธ”๋ก์œผ๋กœ ์ชผ๊ฐœ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ๊ธฐ์กด attention kernel์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ๋ฌผ๋ฆฌ ๋‹จํŽธํ™”๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์ ‘๊ทผ์˜ ํ•ต์‹ฌ์€ ๋‹จํŽธํ™”์˜ ๋ณธ์งˆ์ด ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ์— ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. CUDA virtual memory management(VMM) API๋กœ ๊ฐ€์ƒ ์ฃผ์†Œ ์˜ˆ์•ฝ๊ณผ ๋ฌผ๋ฆฌ ํ• ๋‹น์„ ๋ถ„๋ฆฌํ•˜๋ฉด, OS์˜ demand paging๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ GPU ์„œ๋น™์—๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ decode throughput ์ตœ๋Œ€ 1.99x, ์˜จ๋ผ์ธ end-to-end throughput ์ตœ๋Œ€ 1.23x, FlashAttention-3๊นŒ์ง€์˜ ์ฆ‰์‹œ ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค.

1. ๋ฌธ์ œ์˜์‹

PagedAttention์€ ๋‚ด๋ถ€ ๋‹จํŽธํ™”๋ฅผ ์ค„์ด๋Š” ๋ฐ ํšจ๊ณผ์ ์ด์ง€๋งŒ, KV cache์˜ ๊ฐ€์ƒ ์ฃผ์†Œ๊นŒ์ง€ ๋น„์—ฐ์†์œผ๋กœ ๋ฐ”๊พธ๋Š” ๋Œ€๊ฐ€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ attention kernel์€ ๋ธ”๋ก ํ…Œ์ด๋ธ”์„ ๋”ฐ๋ผ๊ฐ€๋ฉฐ ํ† ํฐ์„ ์ฐธ์กฐํ•ด์•ผ ํ•˜๊ณ , ์„œ๋น™ ํ”„๋ ˆ์ž„์›Œํฌ๋„ ๋ณ„๋„์˜ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๋กœ์ง์„ ๊ฐ–์ถฐ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

llm_0020_vattention_analysis

๊ทธ๋ฆผ 1. PagedAttention์˜ ๋น„์—ฐ์† ๋ ˆ์ด์•„์›ƒ๊ณผ vAttention์˜ ๊ฐ€์ƒ ์—ฐ์†์„ฑ ๋Œ€๋น„

PagedAttention์˜ ๋น„์šฉ

  • ์ปค๋„ ์žฌ์ž‘์„ฑ ํ•„์š”: ๋น„์—ฐ์† KV๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก attention kernel์„ ๋‹ค์‹œ ๊ตฌํ˜„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ธ”๋ก ํ…Œ์ด๋ธ” ๊ด€๋ฆฌ ํ•„์š”: ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ๋™์  ๋ธ”๋ก์˜ ์ฃผ์†Œ์™€ ๋งคํ•‘์„ ์ถ”์ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • CPU/GPU ์˜ค๋ฒ„ํ—ค๋“œ ์ฆ๊ฐ€: ๋ธ”๋ก ํ…Œ์ด๋ธ” ๊ตฌ์„ฑ๊ณผ ๊ฐ„์ ‘ ์ฐธ์กฐ๊ฐ€ ์ž„๊ณ„ ๊ฒฝ๋กœ์— ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.
  • ์ด์‹์„ฑ ์ €ํ•˜: ์ƒˆ๋กœ์šด attention kernel์ด ๋‚˜์˜ฌ ๋•Œ๋งˆ๋‹ค paged ๋ฒ„์ „์„ ๋‹ค์‹œ ๋งŒ๋“ค์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

๊ฐœ๋… ์˜๋ฏธ
๊ฐ€์ƒ ์—ฐ์†์„ฑ ์ปค๋„ ์ž…์žฅ์—์„œ๋Š” ํ•˜๋‚˜์˜ ์—ฐ์† ํ…์„œ๋กœ ๋ณด์ž„
๋ฌผ๋ฆฌ on-demand ์‹ค์ œ ํ† ํฐ์ด ๋Š˜์–ด๋‚  ๋•Œ๋งŒ ๋ฌผ๋ฆฌ ํŽ˜์ด์ง€๋ฅผ ๋งคํ•‘
page-group ํ•œ ๋ฒˆ์˜ VMM ํ˜ธ์ถœ๋กœ ๋ฌถ์–ด ๋‹ค๋ฃจ๋Š” ๋ฌผ๋ฆฌ ํŽ˜์ด์ง€ ๋ฌถ์Œ
granularity CUDA ๊ธฐ๋ณธ์€ 2MB ์ค‘์‹ฌ, ๋…ผ๋ฌธ์€ 64KB ์ง€์›์„ ์ถ”๊ฐ€
request-level buffer ์š”์ฒญ๋ณ„ K/V ๋ฒ„ํผ๋ฅผ ๋ถ„๋ฆฌํ•ด ๊ด€๋ฆฌ

2. ๊ฐ€์ƒ/๋ฌผ๋ฆฌ ๋ถ„๋ฆฌ

vAttention์€ CUDA VMM API๋กœ ๊ฐ€์ƒ ์ฃผ์†Œ์™€ ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์„ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์š”์ฒญ์˜ KV ํ…์„œ์— ๋Œ€ํ•ด ์ตœ๋Œ€ ๊ธธ์ด๋งŒํผ ์—ฐ์†๋œ ๊ฐ€์ƒ ์ฃผ์†Œ๋ฅผ ๋ฏธ๋ฆฌ ์˜ˆ์•ฝํ•˜๊ณ , ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ๋Š” ์‹ค์ œ๋กœ ํ•„์š”ํ•œ ์‹œ์ ์—๋งŒ ๋ถ™์ž…๋‹ˆ๋‹ค.

vAttention Analysis

๊ทธ๋ฆผ 2. ์—ฐ์† ๊ฐ€์ƒ ์ฃผ์†Œ ์˜ˆ์•ฝ๊ณผ ๋ฌผ๋ฆฌ on-demand ๋งคํ•‘ ํ๋ฆ„

๋‹จ๊ณ„ ๋™์ž‘ ๊ด€๋ จ API
1 KV์šฉ ์—ฐ์† ๊ฐ€์ƒ ์ฃผ์†Œ ๊ณต๊ฐ„์„ ์˜ˆ์•ฝ cuMemAddressReserve
2 ํ•„์š”ํ•œ ํฌ๊ธฐ์˜ ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ ํ•ธ๋“ค์„ ์ƒ์„ฑ cuMemCreate
3 ์˜ˆ์•ฝ๋œ ๊ฐ€์ƒ ์ฃผ์†Œ์— ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋งคํ•‘ cuMemMap
4 ์ ‘๊ทผ ๊ถŒํ•œ์„ ๋ถ€์—ฌํ•ด ํ…์„œ์ฒ˜๋Ÿผ ์‚ฌ์šฉ cuMemSetAccess
5 ์š”์ฒญ ์ข…๋ฃŒ ์‹œ ํ•ด์ œ ๋˜๋Š” ์ง€์—ฐ ํšŒ์ˆ˜ cuMemUnmap, cuMemRelease

CUDA ๋ฌธ์„œ๋„ cuMemMap์ด ์ ‘๊ทผ ๊ถŒํ•œ์„ ์ฃผ์ง€ ์•Š์œผ๋ฉฐ, ๋ณ„๋„์˜ cuMemSetAccess๊ฐ€ ํ•„์š”ํ•˜๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. vAttention์€ ์ด ํ๋ฆ„์„ ์š”์ฒญ๋ณ„ KV cache์— ๊ทธ๋Œ€๋กœ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋™์ž‘ ์›๋ฆฌ

  • ๊ฐ€์ƒ ์˜ˆ์•ฝ: 64-bit ์ฃผ์†Œ ๊ณต๊ฐ„์€ ๋„‰๋„‰ํ•˜๋ฏ€๋กœ, ์š”์ฒญ๋ณ„ ์ตœ๋Œ€ ๊ธธ์ด๋งŒํผ VA๋ฅผ ๋ฏธ๋ฆฌ ์žก์•„๋„ ๋ถ€๋‹ด์ด ์ž‘์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ 64-bit ์‹œ์Šคํ…œ์˜ ์‚ฌ์šฉ์ž ๊ณต๊ฐ„์ด ์ถฉ๋ถ„ํžˆ ํฌ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฌผ๋ฆฌ ๋งคํ•‘: ํ† ํฐ์ด ๋Š˜์–ด ๊ธฐ์กด ํŽ˜์ด์ง€๊ฐ€ ๋‹ค ์ฐฐ ๋•Œ๋งŒ ์ƒˆ ํŽ˜์ด์ง€๋ฅผ ๋ถ™์ž…๋‹ˆ๋‹ค.
  • ์ปค๋„ ๊ด€์ : attention kernel์€ KV๊ฐ€ ์—ฐ์†์ด๋ผ๊ณ  ๊ฐ€์ •ํ•œ ์ฑ„ ๊ทธ๋Œ€๋กœ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

3. LLM ํŠนํ™” ์ตœ์ ํ™”

CUDA VMM์€ ์œ ์šฉํ•˜์ง€๋งŒ, LLM ์„œ๋น™์— ๊ทธ๋Œ€๋กœ ์“ฐ๋ฉด ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ์งธ๋Š” ๋Ÿฐํƒ€์ž„ ํ˜ธ์ถœ ์ง€์—ฐ, ๋‘˜์งธ๋Š” 2MB ์ค‘์‹ฌ์˜ ํฐ ํŽ˜์ด์ง€๊ฐ€ ๋งŒ๋“œ๋Š” ๋‚ด๋ถ€ ๋‹จํŽธํ™”์ž…๋‹ˆ๋‹ค.

vAttention Analysis

๊ทธ๋ฆผ 3. 2MB granularity์™€ VMM ํ˜ธ์ถœ ์ง€์—ฐ์— ๋Œ€ํ•œ ๋Œ€์‘

๋ฌธ์ œ ๋…ผ๋ฌธ ๊ด€์ฐฐ ๋Œ€์‘
ํฐ ํŽ˜์ด์ง€ granularity 2MB ๋‹จ์œ„๋Š” ์งง์€ ์‹œํ€€์Šค์—์„œ ๋‚ญ๋น„๊ฐ€ ํผ 64KB page-group ์ง€์›
VMM ํ˜ธ์ถœ ์ง€์—ฐ ํ˜ธ์ถœ๋‹น 5~15ms ์ŠคํŒŒ์ดํฌ ๊ฐ€๋Šฅ compute์™€ allocation overlap
์š”์ฒญ ์ข…๋ฃŒ ์‹œ ์žฌํ™œ์šฉ ์ƒˆ ์š”์ฒญ์ด ๋ฐ”๋กœ ์“ธ ์ˆ˜ ์žˆ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์žฌ์‚ฌ์šฉ์ด ์œ ๋ฆฌ deferred reclamation
์ž„๊ณ„ ๊ฒฝ๋กœ ํ• ๋‹น decode ์ค‘ ๋™๊ธฐ ํ• ๋‹น์€ ์ง€์—ฐ์„ ํ‚ค์›€ eager / proactive allocation

๋…ผ๋ฌธ์€ Yi-6B, Llama-3-8B, Yi-34B์—์„œ per-token ๋ฉ”๋ชจ๋ฆฌ ํ’‹ํ”„๋ฆฐํŠธ๊ฐ€ ๊ฐ๊ฐ 64KB, 128KB, 240KB์ด๋ฉฐ, ์ „์ฒด allocation rate์˜ ์ตœ๋Œ€์น˜๊ฐ€ 750MB/s ์ˆ˜์ค€์ด๋ผ๊ณ  ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋Œ€์—ญํญ์€ LLM inference์—์„œ ๋ณ‘๋ชฉ์ด ์•„๋‹ˆ๋ฉฐ, ์ง€์—ฐ ์ˆจ๊ธฐ๊ธฐ์™€ ๋‹จํŽธํ™” ์™„ํ™”๊ฐ€ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

4. ์„ฑ๋Šฅ๊ณผ ๋น„๊ต

vAttention Analysis

๊ทธ๋ฆผ 4. ๋ณด๊ณ ๋œ throughput๊ณผ PagedAttention ๋Œ€๋น„ ๋น„๊ต

์ง€ํ‘œ ๊ฒฐ๊ณผ ๋น„๊ต ๋Œ€์ƒ
decode throughput ์ตœ๋Œ€ 1.99x vLLM
prefill throughput ์ตœ๋Œ€ 1.24x / 1.26x / 1.24x FA2 Paged, 192K context
prefill throughput ์ตœ๋Œ€ 1.25x / 1.36x / 1.17x FlashInfer Paged
offline end-to-end ์ตœ๋Œ€ 1.18x / 1.15x / 1.13x FA2 Paged
offline end-to-end ์ตœ๋Œ€ 1.19x / 1.23x / 1.14x FlashInfer Paged
online median latency ์ตœ๋Œ€ 42% ๊ฐ์†Œ FA2 Paged

PagedAttention vs vAttention

์ธก๋ฉด PagedAttention vAttention
๊ฐ€์ƒ KV ๋ ˆ์ด์•„์›ƒ ๋น„์—ฐ์† ์—ฐ์†
attention kernel paged์šฉ ์žฌ์ž‘์„ฑ ํ•„์š” ๊ธฐ์กด kernel ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ
๊ด€๋ฆฌ ์ฃผ์ฒด ํ”„๋ ˆ์ž„์›Œํฌ ์ค‘์‹ฌ CUDA VMM / driver ์ค‘์‹ฌ
CPU/GPU ์˜ค๋ฒ„ํ—ค๋“œ ๋†’์Œ ๋‚ฎ์Œ
์ด์‹์„ฑ ๋‚ฎ์Œ ๋†’์Œ
์˜์กด์„ฑ ๋ฒ”์šฉ ๋ธ”๋ก ์ถ”์ƒํ™” NVIDIA CUDA VMM

vAttention์€ FlashAttention-3์ฒ˜๋Ÿผ PagedAttention ์ง€์›์ด ์—†๋˜ ์ปค๋„๋„ ๋ฐ”๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด ์ค๋‹ˆ๋‹ค. ์ฆ‰, ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๋ฐฉ์‹์˜ ๋ณ€๊ฒฝ์ด ์ปค๋„ ํ˜์‹  ์†๋„๋ฅผ ๋”ฐ๋ผ๊ฐ€๋Š” ๋ฐ ๊ฑธ๋ฆผ๋Œ์ด ๋˜์ง€ ์•Š๋„๋ก ๋งŒ๋“œ๋Š” ์ชฝ์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค.

5. ์žฅ๋‹จ์ 

  • ์žฅ์ : ์ปค๋„ ์ˆ˜์ •์ด ๊ฑฐ์˜ ํ•„์š” ์—†๊ณ , ์ตœ์‹  attention kernel์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•˜๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค.
  • ์žฅ์ : ๋ธ”๋ก ํ…Œ์ด๋ธ” ๊ด€๋ฆฌ๊ฐ€ ์‚ฌ๋ผ์ ธ CPU/GPU ๊ฒฝ๋กœ๊ฐ€ ๋‹จ์ˆœํ•ด์ง‘๋‹ˆ๋‹ค.
  • ์žฅ์ : ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ๋งŒ on-demand๋กœ ๋ถ™์ด๋ฏ€๋กœ ๋‹จํŽธํ™”์™€ ๋‚ญ๋น„๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.
  • ์žฅ์ : KV ํฌ๊ธฐ ์ถ•์†Œ ๊ธฐ๋ฒ•(์–‘์žํ™”, GQA, MLA)๊ณผ๋Š” ๋…๋ฆฝ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•œ๊ณ„: CUDA VMM๊ณผ NVIDIA ๋“œ๋ผ์ด๋ฒ„์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค.
  • ํ•œ๊ณ„: ์ž‘์€ ํŽ˜์ด์ง€ ์ง€์›์„ ์œ„ํ•ด ๋“œ๋ผ์ด๋ฒ„ ํ™•์žฅ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•œ๊ณ„: ๋ฌผ๋ฆฌ ๋‹จํŽธํ™”๋Š” ์ค„์ด์ง€๋งŒ KV ์ž์ฒด๋ฅผ ์ค„์ด์ง„ ์•Š์œผ๋ฏ€๋กœ, ๋Œ€์šฉ๋Ÿ‰ ๋ชจ๋ธ์—์„œ๋Š” ์–‘์žํ™”์™€ ๋ณ‘ํ–‰ํ•˜๋Š” ํŽธ์ด ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

6. ๊ด€๋ จ ๊ธฐ์ˆ 

๋ฌธ์„œ/๊ธฐ์ˆ  ์—ฐ๊ฒฐ์ 
vAttention (ASPLOS 2025, arXiv 2405.04437) CUDA VMM ๊ธฐ๋ฐ˜ ์„ค๊ณ„, ์„ฑ๋Šฅ ์ˆ˜์น˜, ์ตœ์ ํ™” ๊ธฐ๋ฒ•์˜ 1์ฐจ ์ถœ์ฒ˜
PagedAttention Analysis ๋ธ”๋ก ๋‹จ์œ„ KV ๊ด€๋ฆฌ์™€ ๋Œ€๋น„๋˜๋Š” ๊ธฐ์ค€์ 
CUDA Driver API Virtual Memory Management cuMemAddressReserve, cuMemCreate, cuMemMap, cuMemSetAccess
Unified Memory (cudaMallocManaged) ๋…ผ๋ฌธ 8์žฅ์—์„œ ๋น„๊ตํ•˜๋Š” ๋Œ€์•ˆ
FlashAttention-2 / FlashAttention-3 vAttention์ด ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•˜๋Š” attention kernel ๊ณ„์—ด
FlashInfer paged / non-paged kernel ๋น„๊ต ๋Œ€์ƒ
vLLM / TensorRT-LLM PagedAttention ๊ธฐ๋ฐ˜ ์„œ๋น™ ์‹œ์Šคํ…œ์˜ ๋Œ€ํ‘œ ์‚ฌ๋ก€

7. ํ•ต์‹ฌ ์ •๋ฆฌ

vAttention์€ PagedAttention๊ณผ ๊ฐ™์€ ๋ชฉํ‘œ, ์ฆ‰ KV cache ๋‹จํŽธํ™” ์™„ํ™”๋ฅผ ๋” ๋‹จ์ˆœํ•œ ๋ฐฉ์‹์œผ๋กœ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. PagedAttention์ด KV์˜ ๊ฐ€์ƒ ๋ ˆ์ด์•„์›ƒ๊นŒ์ง€ ๋น„์—ฐ์†์œผ๋กœ ๋ฐ”๊ฟจ๋‹ค๋ฉด, vAttention์€ CUDA VMM์„ ์ด์šฉํ•ด ๊ฐ€์ƒ ์ฃผ์†Œ๋Š” ์—ฐ์†์œผ๋กœ ์œ ์ง€ํ•˜๊ณ  ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ๋งŒ ํ•„์š”ํ•  ๋•Œ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค. ๋•๋ถ„์— attention kernel์„ ์ˆ˜์ •ํ•˜์ง€ ์•Š๊ณ ๋„ ์ตœ์‹  ์ปค๋„์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ณ , ๋ธ”๋ก ํ…Œ์ด๋ธ”๊ณผ ๊ฐ„์ ‘ ์ฐธ์กฐ ์˜ค๋ฒ„ํ—ค๋“œ๋„ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์€ 2MB ๊ธฐ๋ณธ granularity์™€ VMM ํ˜ธ์ถœ ์ง€์—ฐ์„ 64KB page-group, overlap, deferred reclamation, eager allocation์œผ๋กœ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค. ๋ณด๊ณ  ์„ฑ๋Šฅ์€ decode throughput ์ตœ๋Œ€ 1.99x, offline end-to-end ์ตœ๋Œ€ 1.23x์ด๋ฉฐ, online ํ™˜๊ฒฝ์—์„œ๋„ latency๊ฐ€ ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค. CUDA VMM ์˜์กด์„ฑ์ด ์žˆ๋Š” ๋Œ€์‹ , KV ์–‘์žํ™”๋‚˜ GQA/MLA์™€๋Š” ์ง๊ต์ ์œผ๋กœ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.