Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

Memory Centric LLM Serving Survey

๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ๊ด€์ ์˜ LLM ์„œ๋น™ ์ตœ์ ํ™” โ€” ๋…ผ๋ฌธ ์ •๋ฆฌ

Memory-Centric LLM Serving: KV ์••์ถ•ยท์บ์‹ฑยท์˜คํ”„๋กœ๋”ฉยทP/D ๋ถ„๋ฆฌยทPIM/PNMยท์‹ ๋ขฐ์„ฑ (์ตœ๊ทผ ์—ฐ๊ตฌ ์ข…ํ•ฉ)

๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ(LLM) ์ถ”๋ก ์€ ๋ณธ์งˆ์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ด๋ฉฐ, KV ์บ์‹œ์˜ ์šฉ๋Ÿ‰ ํญ์ฆ๊ณผ ๋””์ฝ”๋”ฉ ๋‹จ๊ณ„์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ํ•ต์‹ฌ ๋ณ‘๋ชฉ์ž…๋‹ˆ๋‹ค. ๋ณธ ๋ฌธ์„œ๋Š” '๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ๊ด€์ '์—์„œ LLM ์„œ๋น™์„ ์ตœ์ ํ™”ํ•˜๋Š” ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์„ ๋„ค ๊ฐˆ๋ž˜ โ€” (1) KV๋ฅผ ์ž‘๊ฒŒ(์••์ถ•ยท์ถ•์ถœ), (2) KV๋ฅผ ์žฌ์‚ฌ์šฉ(์บ์‹ฑยทํŽ˜์ด์ง•), (3) KV๋ฅผ ์–ด๋””์— ๋‘๋‚˜(์˜คํ”„๋กœ๋”ฉยทํ‹ฐ์–ด๋ง), (4) ์–ด๋””์„œ ๊ณ„์‚ฐํ•˜๋‚˜(P/D ๋ถ„๋ฆฌยทPIM/PNM) โ€” ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ณ , ์ด๋ฅผ ๊ฐ€๋กœ์ง€๋ฅด๋Š” ์ •๋ฐ€๋„ยท์‹ ๋ขฐ์„ฑยท์Šค์ผ€์ค„๋ง ์ถ•์„ ๋ณ„๋„๋กœ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ๋Š” ๋Œ€ํ‘œ ๋…ผ๋ฌธ์„ ํ‘œ๋กœ ๋ชจ์•˜์œผ๋ฉฐ, ์ด๋ฆ„ยท๋ฐœํ‘œ์ฒ˜ยทํ•ต์‹ฌ ๊ธฐ์—ฌ๋ฅผ ์š”์•ฝํ–ˆ์Šต๋‹ˆ๋‹ค. ํ‘œ์˜ ์ˆ˜์น˜ยท๊ธฐ์—ฌ๋Š” ๊ฐ ๋…ผ๋ฌธ์˜ ๋ณด๊ณ ๊ฐ’์„ ์š”์•ฝํ•œ ๊ฒƒ์œผ๋กœ, ์ž์„ธํ•œ ์กฐ๊ฑด์€ ์›๋ฌธ์„ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ถ„๋ฅ˜๋Š” ๋‹จ์ˆœํžˆ ๊ธฐ๋ฒ• ๋ชฉ๋ก์„ ๋‚˜์—ดํ•˜๋Š” ๋ฐ์„œ ๋๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์‹ค์ œ ์„œ๋น„์Šค ์„ค๊ณ„์—์„œ๋Š” ๊ฐ™์€ ์š”์ฒญ ๊ฒฝ๋กœ ์•ˆ์—์„œ ์–‘์žํ™”, ํŽ˜์ด์ง•, ์˜คํ”„๋กœ๋”ฉ, ๋ถ„๋ฆฌํ˜• ๋ฐฐ์น˜๊ฐ€ ๋™์‹œ์— ์ ์šฉ๋˜๋ฉฐ, ๊ฐ ๊ธฐ๋ฒ•์€ HBM ์šฉ๋Ÿ‰, PCIe/CXL ๋Œ€์—ญํญ, TTFT/TPOT ๊ฐ™์€ ์ง€์—ฐ ์ง€ํ‘œ์™€ ์ง์ ‘ ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฉ”๋ชจ๋ฆฌ ์ค‘์‹ฌ ๊ด€์ ์€ ๋…ผ๋ฌธ ๋ถ„๋ฅ˜์ด์ž ์‹œ์Šคํ…œ ์šด์˜ ๊ด€์ ์˜ ์ฒดํฌ๋ฆฌ์ŠคํŠธ์ด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

1. ๋ฐฐ๊ฒฝ โ€” ์™œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ณ‘๋ชฉ์ธ๊ฐ€

llm_0003_memory_centric_llm_serving_survey

๊ทธ๋ฆผ 1. LLM ์„œ๋น™์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ฒฝ(์šฉ๋Ÿ‰ยท๋Œ€์—ญํญ)๊ณผ ์ด๊ธฐ์ข… ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต, ๊ทธ๋ฆฌ๊ณ  ๋„ค ๊ฐ€์ง€ ์ตœ์ ํ™” ์งˆ๋ฌธ

LLM ์ถ”๋ก ์˜ ๋””์ฝ”๋”ฉ์€ ๋งค ํ† ํฐ๋งˆ๋‹ค ์ „์ฒด KV ์บ์‹œ์™€ ๊ฐ€์ค‘์น˜๋ฅผ ์ฝ์–ด์•ผ ํ•˜๋Š” memory-bound ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ๋‘ ๊ฐ€์ง€ ์••๋ฐ•์ด ๋™์‹œ์— ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค โ€” ์šฉ๋Ÿ‰(KV ์บ์‹œ๊ฐ€ ๋ฌธ๋งฅยท๋ฐฐ์น˜์— ๋น„๋ก€ํ•ด ํญ์ฆ, 70Bยท128Kยท๋ฐฐ์น˜32์—์„œ KV๋งŒ 150GB+๋กœ HBM ์ดˆ๊ณผ)๊ณผ ๋Œ€์—ญํญ(HBM์€ ๋น ๋ฅด๋‚˜ ์ž‘๊ณ , ์šฉ๋Ÿ‰์„ ๋Š˜๋ฆฌ๋ ค๋ฉด ๋А๋ฆฐ ํ•˜์œ„ ํ‹ฐ์–ด๊ฐ€ ํ•„์š”). ์ด๊ธฐ์ข… ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต(HBM > DRAM > CXL > NVMe)์€ ๋น ๋ฅด๊ณ  ์ž‘์€ ๊ณณ๊ณผ ๋А๋ฆฌ๊ณ  ํฐ ๊ณณ์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋ฅผ ์ด๋ฃน๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ๊ด€์ ์˜ ์ตœ์ ํ™”๋Š” ๊ฒฐ๊ตญ ๋„ค ์งˆ๋ฌธ์œผ๋กœ ์ •๋ฆฌ๋ฉ๋‹ˆ๋‹ค: KV๋ฅผ ์–ด๋–ป๊ฒŒ ์ž‘๊ฒŒ ๋งŒ๋“ค๊ณ , ์–ด๋–ป๊ฒŒ ์žฌ์‚ฌ์šฉํ•˜๊ณ , ์–ด๋””์— ๋‘๊ณ , ์–ด๋””์„œ ๊ณ„์‚ฐํ•˜๋Š”๊ฐ€.

2. ๋ถ„๋ฅ˜ ์ฒด๊ณ„

Memory Centric LLM Serving Survey

๊ทธ๋ฆผ 2. ๋ฉ”๋ชจ๋ฆฌ ๊ด€์  LLM ์„œ๋น™ ์ตœ์ ํ™”์˜ ๋„ค ์นดํ…Œ๊ณ ๋ฆฌ์™€ ๊ฐ€๋กœ์ง€๋ฅด๋Š” ์ถ•(์‹ ๋ขฐ์„ฑยท์ •๋ฐ€๋„ยทSLO)

KV ์บ์‹œ ๊ด€๋ฆฌ ์„œ๋ฒ ์ด(Li et al., TMLR 2025, arXiv 2412.19442)๋Š” ์ตœ์ ํ™”๋ฅผ token-levelยทmodel-levelยทsystem-level๋กœ ๋ถ„๋ฅ˜ํ•˜๊ณ , ๋˜ ๋‹ค๋ฅธ ๋ฆฌ๋ทฐ(arXiv 2603.20397)๋Š” cache evictionยทcompressionยทhybrid memoryยทnovel attentionยทcombination์˜ ๋‹ค์„ฏ ๋ฐฉํ–ฅ์œผ๋กœ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ฌธ์„œ๋Š” ์ด๋ฅผ ๋ฉ”๋ชจ๋ฆฌ ๊ด€์ ์˜ ๋„ค ๊ฐˆ๋ž˜๋กœ ์žฌ๊ตฌ์„ฑํ•˜๊ณ , ์ •๋ฐ€๋„ยท์‹ ๋ขฐ์„ฑยท์Šค์ผ€์ค„๋ง์„ ๊ฐ€๋กœ์ง€๋ฅด๋Š” ์ถ•์œผ๋กœ ๋‘ก๋‹ˆ๋‹ค. ํŠนํžˆ ์‹ ๋ขฐ์„ฑ์€ ์„ฑ๋Šฅ ์ค‘์‹ฌ ์—ฐ๊ตฌ์— ๋น„ํ•ด ๋น„์–ด ์žˆ๋Š” ์˜์—ญ์ด๋ผ ๋ณ„๋„ ์ ˆ(7์žฅ)๋กœ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

3. KV๋ฅผ ์ž‘๊ฒŒ โ€” ์••์ถ•ยท์ถ•์ถœยท์•„ํ‚คํ…์ฒ˜

KV ์บ์‹œ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š” ์ ‘๊ทผ์€ ๋„ค ์ถ•์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค: ํ† ํฐ ์ถ•์ถœ(์ค‘์š”ํ•˜์ง€ ์•Š์€ ํ† ํฐ ๋ฒ„๋ฆฌ๊ธฐ), ๋น„ํŠธ(์–‘์žํ™”), ํ—ค๋“œ/์ €์ฐจ์›(์•„ํ‚คํ…์ฒ˜), ๊ทธ๋ฆฌ๊ณ  ์–ดํ…์…˜ ํŒจํ„ด(๊ตญ์†Œํ™”).

์‹ค์ œ๋กœ๋Š” ๋„ค ์ถ•์ด ์„œ๋กœ ๊ฒน์นฉ๋‹ˆ๋‹ค. ํ† ํฐ ์ถ•์ถœ์€ ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ ํ† ํฐ์„ ๋ฒ„๋ ค ๊ธธ์ด๋ฅผ ์ค„์ด๊ณ , ์–‘์žํ™”๋Š” ๊ฐ™์€ ์ •๋ณด๋ฅผ ๋” ์ ์€ ๋น„ํŠธ๋กœ ๋‹ด์œผ๋ฉฐ, ์•„ํ‚คํ…์ฒ˜ ๊ธฐ๋ฒ•์€ ์• ์ดˆ์— KV๊ฐ€ ๋œ ์ƒ๊ธฐ๋„๋ก ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค. ๊ตญ์†Œ ์–ดํ…์…˜์€ ๊ธด ๋ฌธ๋งฅ์„ ์™„์ „ ์žฌ๊ณ„์‚ฐํ•˜์ง€ ์•Š์œผ๋ฉด์„œ๋„ ์บ์‹œ ํฌ๊ธฐ๋ฅผ ์ œํ•œํ•˜๋Š” ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค.

์ถ• ๋Œ€ํ‘œ ๊ธฐ๋ฒ• ํ•ต์‹ฌ ์•„์ด๋””์–ด
ํ† ํฐ ์ถ•์ถœยท์„ ํƒ H2O, StreamingLLM, SnapKV, Scissorhands, Keyformer, BUZZ, Quest attention score, ์ตœ๊ทผ์„ฑ, ์งˆ์˜ ๊ด€๋ จ์„ฑ, ์ค‘์š”๋„ ์ง€์†์„ฑ์„ ์ด์šฉํ•ด ์œ ์ง€ํ•  ํ† ํฐ๋งŒ ๋‚จ๊น€
์–‘์žํ™” KIVI, KVQuant, Oaken Key/Value๋ฅผ ๋น„๋Œ€์นญยท๋น„๊ท ์ผยทํ˜ผํ•ฉ ์ •๋ฐ€๋„๋กœ ์ €์žฅํ•ด ๋ฉ”๋ชจ๋ฆฌ์™€ ๋Œ€์—ญํญ์„ ์ค„์ž„
ํ—ค๋“œยท์ €์ฐจ์›ยท์–ดํ…์…˜ GQA/MQA, MLA, Sliding Window Attention, Gemma 2 interleave head ์ˆ˜๋‚˜ latent ์ฐจ์›์„ ์ค„์ด๊ฑฐ๋‚˜, ์ง€์—ญ ์–ดํ…์…˜์œผ๋กœ KV ์ƒ์„ฑ์„ ์–ต์ œ

3.1 ํ† ํฐ ์ถ•์ถœยท์„ ํƒ (Token Eviction / Selection)

ํ† ํฐ ์ถ•์ถœ์€ KV ์ „์ฒด๋ฅผ ์œ ์ง€ํ•˜์ง€ ์•Š๊ณ , ๋‹ค์Œ ํ† ํฐ ์˜ˆ์ธก์— ๋œ ์ค‘์š”ํ•œ ํ•ญ๋ชฉ์„ ๋ฒ„๋ฆฌ๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. StreamingLLM์€ attention sink์™€ ์ตœ๊ทผ ์œˆ๋„์šฐ๋ฅผ ์œ ์ง€ํ•˜๊ณ , H2O๋Š” ๋ˆ„์  ์–ดํ…์…˜์ด ๋†’์€ heavy hitter๋ฅผ ๋‚จ๊น๋‹ˆ๋‹ค. SnapKV๋Š” prefill ๋์—์„œ ๊ด€์ธกํ•œ ์ฐฝ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ค‘์š”ํ•œ ์œ„์น˜๋ฅผ ๊ณ ๋ฅด๊ณ , Scissorhands๋Š” ์ค‘์š” ํ† ํฐ์˜ ์ง€์†์„ฑ์„ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. Quest๋Š” ์งˆ์˜ ์ธ์‹ํ˜• ํŽ˜์ด์ง€ ๋กœ๋”ฉ์œผ๋กœ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ ๋ฒ”์œ„๋ฅผ ์ขํžˆ๊ณ , Ada-KV์™€ NACL์€ ์˜ˆ์‚ฐ๊ณผ ๊ณต์ •์„ฑ์„ ํ•จ๊ป˜ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

3.2 ์–‘์žํ™” (Quantization)

์–‘์žํ™”๋Š” KV๋ฅผ ๋” ๋‚ฎ์€ ๋น„ํŠธ์ˆ˜๋กœ ํ‘œํ˜„ํ•ด ์šฉ๋Ÿ‰๊ณผ ์ „์†ก๋Ÿ‰์„ ๋™์‹œ์— ์ค„์ž…๋‹ˆ๋‹ค. KIVI๋Š” Key์™€ Value์— ์„œ๋กœ ๋‹ค๋ฅธ ์ถ•์˜ ์–‘์žํ™”๋ฅผ ์ ์šฉํ•˜๊ณ , KVQuant๋Š” ๋น„๊ท ์ผ ์–‘์žํ™”์™€ per-channel ์„ค๊ณ„๋ฅผ ๊ฒฐํ•ฉํ•ด ๊ธด ๋ฌธ๋งฅ์—์„œ๋„ ์ •ํ™•๋„๋ฅผ ์ง€ํ‚ต๋‹ˆ๋‹ค. Oaken์€ ์˜จ๋ผ์ธยท์˜คํ”„๋ผ์ธ์„ ์„ž์€ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ „๋žต์œผ๋กœ ์„œ๋น™ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ ํ•จ๊ป˜ ๋…ธ๋ฆฝ๋‹ˆ๋‹ค.

3.3 ํ—ค๋“œยท์ €์ฐจ์›ยท์•„ํ‚คํ…์ฒ˜ (Architectural)

์•„ํ‚คํ…์ฒ˜ ๊ธฐ๋ฒ•์€ KV ์ž์ฒด์˜ ์ƒ์„ฑ๋Ÿ‰์„ ์ค„์ด๊ฑฐ๋‚˜, ๋ฌธ๋งฅ ์ „์ฒด ๋Œ€์‹  ํ•„์š”ํ•œ ๋ฒ”์œ„๋งŒ ๋ณด๋„๋ก ๋งŒ๋“ญ๋‹ˆ๋‹ค. GQA/MQA๋Š” KV head ์ˆ˜๋ฅผ ์ค„์—ฌ ์บ์‹œ๋ฅผ ์••์ถ•ํ•˜๊ณ , MLA๋Š” ์ €์ฐจ์› latent๋กœ KV๋ฅผ ํก์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. Sliding Window Attention์€ ์ตœ๊ทผ ํ† ํฐ๋งŒ ๋ณด๋Š” ๊ตญ์†Œ ์–ดํ…์…˜์œผ๋กœ ๊ธธ์ด ์˜์กด์„ฑ์„ ์ œํ•œํ•˜๊ณ , Gemma 2์˜ interleave ๋ฐฉ์‹์€ ๊ตญ์†Œ/์ „์—ญ ๋ ˆ์ด์–ด๋ฅผ ์„ž์–ด ํšจ์œจ๊ณผ ์žฅ๊ฑฐ๋ฆฌ ์„ฑ๋Šฅ์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

4. KV๋ฅผ ์žฌ์‚ฌ์šฉ โ€” ํŽ˜์ด์ง•ยท์บ์‹ฑ

์ด๋ฏธ ๊ณ„์‚ฐํ•œ KV๋ฅผ ๋‹ค์‹œ ๋งŒ๋“ค์ง€ ์•Š๋„๋ก ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ณ (ํŽ˜์ด์ง•), ๊ณตํ†ต ๋ถ€๋ถ„์„ ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค(์บ์‹ฑ).

ํŽ˜์ด์ง• ๊ณ„์—ด์€ fragmentation์„ ์ค„์ด๊ณ , ์บ์‹ฑ ๊ณ„์—ด์€ prefix๋‚˜ ์˜๋ฏธ๊ฐ€ ๊ฐ™์€ ์š”์ฒญ์„ ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‘˜ ๋‹ค ๋ชฉ์ ์€ ๊ฐ™์ง€๋งŒ, ์ „์ž๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ์น˜์˜ ๋‚ญ๋น„๋ฅผ ์ค„์ด๊ณ  ํ›„์ž๋Š” ์ค‘๋ณต ๊ณ„์‚ฐ ์ž์ฒด๋ฅผ ํ”ผํ•œ๋‹ค๋Š” ์ ์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

๋ฐฉ์‹ ์žฌ์‚ฌ์šฉ ๋Œ€์ƒ ๊ฐ•์  ์ฃผ์˜์ 
PagedAttention / vLLM KV ๋ธ”๋ก near-zero waste, ๋†’์€ batch ํšจ์œจ ๋ธ”๋ก ๊ฐ„ ๊ฐ„์ ‘ ์ฐธ์กฐ ๋น„์šฉ์ด ์žˆ์Œ
vAttention ์—ฐ์† ๊ฐ€์ƒ ์ฃผ์†Œ + ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ CUDA VMM์œผ๋กœ ๋ฒ”์šฉ kernel ์ง€์› VMM ์ง€์›๊ณผ ๊ตฌํ˜„ ๋ณต์žก๋„ ์˜์กด
RadixAttention / APC prefix tree ๋ธ”๋ก ๋ฉ€ํ‹ฐํ„ด ํ”„๋กฌํ”„ํŠธ ๊ณต์œ ์— ๊ฐ•ํ•จ ์ •ํ™•ํ•œ prefix ์ผ์น˜๊ฐ€ ํ•„์š”
Prompt caching provider prompt API ์ˆ˜์ค€์—์„œ ๊ฐ„๋‹จํžˆ ์ ์šฉ ์ •์  ํ”„๋กฌํ”„ํŠธ์— ์ ํ•ฉ
Semantic caching ์˜๋ฏธ ์œ ์‚ฌ ์งˆ์˜/์‘๋‹ต ๋ฐ˜๋ณต ์งˆ์˜ ๋น„์šฉ์„ ํฌ๊ฒŒ ์ค„์ž„ false hit/miss ์ œ์–ด๊ฐ€ ํ•ต์‹ฌ

MeanCache๋Š” ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•œ ์งˆ์˜๋ฅผ ์‚ฌ์šฉ์ž ๋‹จ์œ„๋กœ ๋‹ค๋ฃจ๊ณ , ์ปจํ…์ŠคํŠธ ์ฒด์ธ์„ ํ•จ๊ป˜ ์ €์žฅํ•ด contextual query์˜ ์˜คํŒ์„ ์ค„์ด๋ ค๋Š” ์ ‘๊ทผ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๊ณ„์—ด์€ ๊ฒ€์ƒ‰ํ˜• ์„œ๋น„์Šค๋‚˜ ๋ฐ˜๋ณต ์งˆ์˜๊ฐ€ ๋งŽ์€ ํ™˜๊ฒฝ์—์„œ ํŠนํžˆ ์œ ํšจํ•ฉ๋‹ˆ๋‹ค.

5. KV๋ฅผ ์–ด๋”” ๋‘๋‚˜ โ€” ์˜คํ”„๋กœ๋”ฉยทํ‹ฐ์–ด๋ง

Memory Centric LLM Serving Survey

๊ทธ๋ฆผ 3. ์˜คํ”„๋กœ๋”ฉยทํ‹ฐ์–ด๋ง๊ณผ ์ „์†ก ๋ณ‘๋ชฉ(PCIe)์„ ์ค„์ด๋Š” ํ•ต์‹ฌ ๊ธฐ๋ฒ•

GPU์— ๋‹ค ๋‹ด์ง€ ๋ชปํ•˜๋Š” KV(๋ฐ ๊ฐ€์ค‘์น˜)๋ฅผ CPUยทCXLยทNVMe๋กœ ๋‚ด๋ฆฌ๊ณ , ํ•„์š”ํ•œ ๊ฒƒ๋งŒ ์˜ˆ์ธกํ•ด ์ ์žฌํ•ฉ๋‹ˆ๋‹ค. ๊ณตํ†ต ๋ณ‘๋ชฉ์€ PCIe/CXL ๋งํฌ ๋Œ€์—ญํญ์ด๋ฉฐ, ์˜ˆ์ธกยทํ”„๋ฆฌํŽ˜์น˜ยท์••์ถ•ยท์ŠคํŠธ๋ฆฌ๋ฐ์œผ๋กœ ์ „์†ก๋Ÿ‰์„ ์ค„์ž…๋‹ˆ๋‹ค. CXL์€ ๋‹จ์ˆœํ•œ ๋А๋ฆฐ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์•„๋‹ˆ๋ผ, ์—ฌ๋Ÿฌ ์„œ๋ฒ„๊ฐ€ ๋‚˜๋ˆ  ์“ฐ๋Š” pooled capacity๋กœ๋„ ์“ฐ์ž…๋‹ˆ๋‹ค.

ํ‹ฐ์–ด ํŠน์ง• ์“ฐ์ž„์ƒˆ
HBM ๊ฐ€์žฅ ๋น ๋ฅด์ง€๋งŒ ์ž‘์Œ hot KV, ํ™œ์„ฑ ๊ฐ€์ค‘์น˜
DRAM HBM๋ณด๋‹ค ํฌ๊ณ  ๋А๋ฆผ warm KV, host-side cache
CXL ํ’€๋ง ์šฉ๋Ÿ‰, ์ค‘๊ฐ„ ๋Œ€์—ญํญ ๊ณต์œ  KV, tiering
NVMe / SSD ๊ฐ€์žฅ ํฐ ์šฉ๋Ÿ‰, ๊ฐ€์žฅ ํฐ ์ง€์—ฐ cold KV, archival offload

FlexGen์€ ๋ ˆ์ด์–ด๋ณ„ placement๋ฅผ LP๋กœ ์ตœ์ ํ™”ํ•˜๊ณ , DeepSpeed-Inference๋Š” ๋ ˆ์ด์–ด ๋‹จ์œ„ ์˜คํ”„๋กœ๋”ฉ์œผ๋กœ GPU ๋ฉ”๋ชจ๋ฆฌ ์˜ˆ์‚ฐ์„ ์ค„์ž…๋‹ˆ๋‹ค. InfiniGen์€ ๋‹ค์Œ ๋ ˆ์ด์–ด์— ํ•„์š”ํ•œ KV๋งŒ rehearsal ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ธกํ•ด ๊ฐ€์ ธ์˜ค๊ณ , CacheGen๊ณผ LMCache๋Š” ์••์ถ•ยท๋‹ค์ธต ์ €์žฅ์œผ๋กœ ์ „์†ก๊ณผ ์žฌ์‚ฌ์šฉ์„ ํ•จ๊ป˜ ๋‹ค๋ฃน๋‹ˆ๋‹ค. Select-N๊ณผ Aqua๋Š” SLO์™€ ๋„คํŠธ์›Œํฌ ์ƒํ™ฉ์„ ํ•จ๊ป˜ ๊ณ ๋ คํ•ด ์–ด๋–ค ์š”์ฒญ์„ ๋‚ด๋ฆด์ง€ ์ •ํ•ฉ๋‹ˆ๋‹ค.

6. ์–ด๋””์„œ ๊ณ„์‚ฐํ•˜๋‚˜ โ€” P/D ๋ถ„๋ฆฌ์™€ PIM/PNM

6.1 Prefillโ€“Decode ๋ถ„๋ฆฌ (Disaggregation)

๊ณ„์‚ฐ์ง‘์•ฝ prefill๊ณผ ๋ฉ”๋ชจ๋ฆฌ์ง‘์•ฝ decode๋ฅผ ๋‹ค๋ฅธ ํ•˜๋“œ์›จ์–ด๋กœ ๋ถ„๋ฆฌํ•ด ๊ฐ„์„ญ์„ ์—†์• ๊ณ  ๊ฐ์ž ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.

๋‹จ๊ณ„ ๋ณ‘๋ชฉ ๋งž๋Š” ๋ฐฐ์น˜
Prefill compute-bound GEMM GPU / NPU, ํฐ ๋ฐฐ์น˜
Decode memory-bound GEMV PIM / PNM, ๋†’์€ ๋Œ€์—ญํญ

DistServe๋Š” prefill๊ณผ decode๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ GPU์— ๋ฐฐ์น˜ํ•ด TTFT์™€ TPOT๋ฅผ ๋ถ„๋ฆฌํ•ด ์ตœ์ ํ™”ํ•˜๊ณ , Splitwise๋Š” prompt์™€ token ๋‹จ๊ณ„๋ฅผ ๋‹ค๋ฅธ ๋จธ์‹ ์ด๋‚˜ ์ด์ข… GPU๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. Mooncake๋Š” KVCache ์ž์ฒด๋ฅผ ๋ถ„๋ฆฌํ˜• ํ’€๋กœ ๋‹ค๋ค„ ๊ณต์œ ๋ฅผ ์ „์ œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ๋ถ„๋ฆฌ๊ฐ€ ๋ชฉ์ ์ด ์•„๋‹ˆ๋ผ, ๋‹จ๊ณ„๋ณ„๋กœ ๋‹ค๋ฅธ ๋ณ‘๋ชฉ์— ๋งž๋Š” ์ž์›์„ ์ฃผ๋Š” ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค.

6.2 PIM / PNM โ€” ๋ฉ”๋ชจ๋ฆฌ ๊ณ ์—ฐ์‚ฐ

Memory Centric LLM Serving Survey

๊ทธ๋ฆผ 4. CXLยทPIM/PNM(๋ฉ”๋ชจ๋ฆฌ ๊ณ ์—ฐ์‚ฐ)๊ณผ ๋Œ€ํ‘œ ์—ฐ๊ตฌ โ€” Ryotta ์—ฐ๊ตฌ์™€ ์ง๊ฒฐ

๋””์ฝ”๋”ฉ์˜ GEMV๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ๋ผ ๋ฑ…ํฌ ๋ณ‘๋ ฌ PIM์— ์ ํ•ฉํ•˜๊ณ , prefill์˜ GEMM์€ NPU/GPU๊ฐ€ ๋งก๋Š” ์ด์ข… ๊ฐ€์†์ด ํ•ต์‹ฌ ํŒจํ„ด์ž…๋‹ˆ๋‹ค. CXL์€ ์šฉ๋Ÿ‰ยทํ’€๋ง ํ‹ฐ์–ด๋ฅผ, PNM์€ ๋ฉ”๋ชจ๋ฆฌ ๊ณ ์—ฐ์‚ฐ์œผ๋กœ ์ „์†ก ์ž์ฒด๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.

AttAcc, NeuPIMs, IANUS, SpecPIM์€ HBM-PIM๊ณผ NPU๋ฅผ ํ•จ๊ป˜ ์จ์„œ ๋””์ฝ”๋”ฉ ๋Œ€์—ญํญ ๋ณ‘๋ชฉ์„ ์™„ํ™”ํ•˜๊ณ , LPDDR-CXL-PNM, CXL-NDP, Scalable CXL-PNM, Sangam์€ CXL ๊ณ„์ธต์ด๋‚˜ ์นฉ๋ › DRAM ์ชฝ์—์„œ ๋” ๋งŽ์€ ์ผ์„ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด ๊ฐˆ๋ž˜๋Š” ๋ชจ๋‘ "๋ฐ์ดํ„ฐ๋ฅผ GPU๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ ์ „์—, ๋ฉ”๋ชจ๋ฆฌ ๊ทผ์ฒ˜์—์„œ ๋๋‚ผ ์ˆ˜ ์žˆ๋‚˜"๋ฅผ ๋ฌป์Šต๋‹ˆ๋‹ค.

7. ์žฅ๋‹จ์ 

๋ฉ”๋ชจ๋ฆฌ ์ค‘์‹ฌ ์ตœ์ ํ™”์˜ ๊ฐ€์žฅ ํฐ ์žฅ์ ์€, LLM ์„œ๋น™์˜ ์‹ค์ œ ๋ณ‘๋ชฉ์ธ KV ์บ์‹œ ์šฉ๋Ÿ‰๊ณผ ๋””์ฝ”๋”ฉ ๋Œ€์—ญํญ์„ ์ง์ ‘ ๊ฒจ๋ƒฅํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์–‘์žํ™”ยท์ถ•์ถœยท์•„ํ‚คํ…์ฒ˜ ๋ณ€๊ฒฝ์€ ๊ฐ™์€ HBM ์˜ˆ์‚ฐ์—์„œ ๋” ๊ธด ๋ฌธ๋งฅ๊ณผ ๋” ํฐ ๋ฐฐ์น˜๋ฅผ ์ˆ˜์šฉํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ , ํŽ˜์ด์ง•ยท์บ์‹ฑ์€ ๋‹จํŽธํ™”์™€ ์ค‘๋ณต ๊ณ„์‚ฐ์„ ์ค„์—ฌ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋†’์ž…๋‹ˆ๋‹ค. ์˜คํ”„๋กœ๋”ฉยทํ‹ฐ์–ด๋ง๊ณผ P/D ๋ถ„๋ฆฌ๋Š” HBM ๋ฐ–์˜ DRAMยทCXLยทSSD๊นŒ์ง€ ์ž์› ํ’€์„ ๋„“ํ˜€, ๋‹จ์ผ GPU ๋ฉ”๋ชจ๋ฆฌ ํ•œ๊ณ„๋ฅผ ์‹œ์Šคํ…œ ์ฐจ์› ๋ฌธ์ œ๋กœ ์žฌ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

๋ฐ˜๋ฉด ๊ธฐ๋ฒ• ๊ฐ„ ๊ฒฐํ•ฉ์ด ๋ณต์žกํ•ด์งˆ์ˆ˜๋ก ๊ตฌํ˜„ ๋‚œ๋„์™€ ์šด์˜ ๋ฆฌ์Šคํฌ๋„ ์ปค์ง‘๋‹ˆ๋‹ค. KV ์ถ•์ถœ๊ณผ ์–‘์žํ™”๋Š” ์ •ํ™•๋„ ์ €ํ•˜๋‚˜ ํ’ˆ์งˆ ํŽธ์ฐจ๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๊ณ , ์˜คํ”„๋กœ๋”ฉยทํ‹ฐ์–ด๋ง์€ PCIe/CXL ๋ณ‘๋ชฉ๊ณผ ์˜ˆ์ธก ์‹คํŒจ ๋น„์šฉ์„ ๋™๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค. ๋ถ„๋ฆฌํ˜• ๋ฐฐ์น˜์™€ PIM/PNM์€ ์„ฑ๋Šฅ ์ž ์žฌ๋ ฅ์€ ํฌ์ง€๋งŒ, ๋Ÿฐํƒ€์ž„ ์Šค์ผ€์ค„๋Ÿฌยท๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์žยท์ปค๋„ ํ˜ธํ™˜์„ฑ๊นŒ์ง€ ํ•จ๊ป˜ ๋งž์ถฐ์•ผ ํ•˜๋ฏ€๋กœ ์†Œํ”„ํŠธ์›จ์–ด ์Šคํƒ ์˜์กด์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.

8. ๊ฐ€๋กœ์ง€๋ฅด๋Š” ์ถ• โ€” ์‹ ๋ขฐ์„ฑ(๋น„์–ด ์žˆ๋Š” ์˜์—ญ)

Memory Centric LLM Serving Survey

๊ทธ๋ฆผ 5. KV ์บ์‹œ์˜ ์˜ค๋ฅ˜ ๋‚ด์„ฑ, reliability-aware ์„ค๊ณ„ ๊ธฐํšŒ, ๊ทธ๋ฆฌ๊ณ  ์ „์ฒด ์ •๋ฆฌ

์„ฑ๋Šฅ ์—ฐ๊ตฌ์— ๋น„ํ•ด ์‹ ๋ขฐ์„ฑ์€ ์ƒ๋Œ€์ ์œผ๋กœ ๋น„์–ด ์žˆ๋Š” ์˜์—ญ์ž…๋‹ˆ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์ด ๊ณตํ†ต์ ์œผ๋กœ ๋ณด๊ณ ํ•˜๋Š” ํ•ต์‹ฌ ๋ฐœ๊ฒฌ์€, KV ์บ์‹œ๊ฐ€ ๋น„ํŠธ ์˜ค๋ฅ˜์— ๋น„๊ต์  ๊ฐ•ํ•˜๋‹ค๋Š” ๊ฒƒ โ€” softmax ์—ฐ์‚ฐ์ด ์˜ค๋ฅ˜๋ฅผ ํฌ์„(masking)ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ ํ† ํฐ ์ค‘์š”๋„ยท๋น„ํŠธ ์œ„์น˜(MSB)ยท๋ ˆ์ด์–ด์— ๋”ฐ๋ผ ๋ฏผ๊ฐ๋„๊ฐ€ ๋‹ค๋ฅด๋ฏ€๋กœ ์ฐจ๋“ฑ ๋ณดํ˜ธ์˜ ์—ฌ์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฐ๊ตฌ ๊ธฐํšŒ โ€” KV๊ฐ€ ๋น„ํŠธ ์˜ค๋ฅ˜์— ๋น„๊ต์  ๊ฐ•ํ•˜๋‹ค๋Š” ์ ๊ณผ ์ค‘์š”๋„๋ณ„ ๋ฏผ๊ฐ๋„ ์ฐจ์ด๋ฅผ ๊ฒฐํ•ฉํ•˜๋ฉด, '์–ด๋А KV๋ฅผ ยท ์–ด๋А ํ‹ฐ์–ด์— ยท ๋ช‡ ๋น„ํŠธ๋กœ ยท ์–ด๋–ค ECC ๊ฐ•๋„๋กœ' ๋‘˜์ง€๋ฅผ ๋™์‹œ์— ์ •ํ•˜๋Š” reliability-aware ํ‹ฐ์–ด๋ง/์˜คํ”„๋กœ๋”ฉ์ด ์ž์—ฐ์Šค๋Ÿฌ์šด ๋นˆ์นธ์ด๋‹ค. ์„ฑ๋Šฅ SLO์™€ ์ •ํ™•๋„(์˜ค๋ฅ˜) SLO๋ฅผ ํ•จ๊ป˜ ๋‹ค๋ฃจ๋Š” ์Šค์ผ€์ค„๋ง์€ ์•„์ง ์ถฉ๋ถ„ํžˆ ํƒ๊ตฌ๋˜์ง€ ์•Š์•˜๋‹ค. ์ด๋Š” ์ด๊ธฐ์ข… ๋ฉ”๋ชจ๋ฆฌ(CXLยทPIM/PNM)์—์„œ ํŠนํžˆ ์˜๋ฏธ๊ฐ€ ํฐ๋ฐ, ํ‹ฐ์–ด๋งˆ๋‹ค ๋น„ํŠธ ์˜ค๋ฅ˜ ํŠน์„ฑ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๋ฐฉํ–ฅ ๋Œ€ํ‘œ ์—ฐ๊ตฌ ์‹œ์‚ฌ์ 
refresh-aware Kelle, SHIELD ์ค‘์š” ํ† ํฐ๊ณผ ๋น„์ค‘์š” ํ† ํฐ์„ ๋‹ค๋ฅธ refresh ์ •์ฑ…์œผ๋กœ ๋‹ค๋ฃธ
storage-aware KVNAND KV๋ฅผ NAND ์ชฝ์— ๋‘๋”๋ผ๋„ softmax masking์ด ์–ด๋А ์ •๋„ ๋ฒ„ํŒ€
precision-aware FineServe ์–‘์žํ™” ํŠน์„ฑ์— ๋งž์ถ˜ slab / scheduling์ด ๊ฐ€๋Šฅํ•จ
fault study GPU soft-error ์—ฐ๊ตฌ ๋‹จ์ผ ๋น„ํŠธ๋Š” ์ž์ฃผ ๊ฐ€๋ ค์ง€์ง€๋งŒ ๋ˆ„์  ์˜ค๋ฅ˜๋Š” ์œ„ํ—˜ํ•จ

๊ด€๋ จ ์„œ๋ฒ ์ด์™€ ๊ธฐ์ค€ ๋ฌธํ—Œ

์ž๋ฃŒ ๊ด€์ 
A Survey on Large Language Model Acceleration based on KV Cache Management (arXiv 2412.19442) token-level, model-level, system-level taxonomy
A Survey on Efficient Inference for Large Language Models (arXiv 2404.14294) data/model/system optimization๊ณผ ๋ณ‘๋ชฉ ์›์ธ ์ •๋ฆฌ
Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective (arXiv 2410.04466) CPU/GPU/FPGA/ASIC/PIM/NDP ๊ด€์  ๋น„๊ต
PagedAttention Analysis page/block ๊ธฐ๋ฐ˜ KV ๊ด€๋ฆฌ
KV Cache Quantization Analysis KV ๋น„ํŠธํญ ์ถ•์†Œ์™€ ์ •ํ™•๋„ ์ ˆ์ถฉ
Disaggregated LLM Serving Analysis P/D ๋ถ„๋ฆฌ์™€ ํด๋Ÿฌ์Šคํ„ฐ ๋ฐฐ์น˜

9. ์ข…ํ•ฉ ์ •๋ฆฌ

๋ฉ”๋ชจ๋ฆฌ ๊ด€์ ์—์„œ LLM ์„œ๋น™ ์ตœ์ ํ™”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ˆ˜๋ ดํ•ฉ๋‹ˆ๋‹ค.

  • KV๋ฅผ ์ž‘๊ฒŒ โ€” ํ† ํฐ ์ถ•์ถœ(H2OยทStreamingLLMยทSnapKVยทQuestยทScissorhands)ยท์–‘์žํ™”(KIVIยทKVQuantยทOaken)ยท์ €์ฐจ์›(MLAยทGQA)ยท๊ตญ์†Œ ์–ดํ…์…˜(SWAยทGemma ์ธํ„ฐ๋ฆฌ๋ธŒ).

  • KV๋ฅผ ์žฌ์‚ฌ์šฉ โ€” ํŽ˜์ด์ง•(PagedAttentionยทvAttention)ยทํ”„๋ฆฌํ”ฝ์Šค ์บ์‹ฑ(RadixAttentionยทAPCยทํ”„๋กœ๋ฐ”์ด๋” prompt caching)ยท์‹œ๋งจํ‹ฑ ์บ์‹ฑ(GPTCache)ยท์ด๊ธฐ์ข… ํŽ˜์ด์ง•(Jenga).

  • KV๋ฅผ ์–ด๋”” ๋‘๋‚˜ โ€” ์˜คํ”„๋กœ๋”ฉ(FlexGenยทDeepSpeedยทPowerInferยทLMCacheยทCacheGenยทInfiniGen)๊ณผ SLOยท๋„คํŠธ์›Œํฌ ์ธ์‹ ๋ฐฐ์น˜(Select-NยทAqua), ์˜ˆ์ธกยทํ”„๋ฆฌํŽ˜์น˜๋กœ PCIe ๋ณ‘๋ชฉ ์™„ํ™”.

  • ์–ด๋””์„œ ๊ณ„์‚ฐํ•˜๋‚˜ โ€” P/D ๋ถ„๋ฆฌ(DistServeยทSplitwiseยทMooncakeยทSarathi-ServeยทTetriInfer)์™€ PIM/PNM(AttAccยทNeuPIMsยทIANUSยทSpecPIM)ยทCXL ํ‹ฐ์–ด/๊ณ์—ฐ์‚ฐ(LPDDR-CXL-PNMยทPondยทCXL-NDPยทScalable CXL-PNMยทSangam).

  • ๊ฐ€๋กœ์ง€๋ฅด๋Š” ์ถ• โ€” ์ •๋ฐ€๋„(์–‘์žํ™”)ยท์‹ ๋ขฐ์„ฑ(KelleยทSHIELDยทKVNANDยทsoft-error ์—ฐ๊ตฌ)ยท์Šค์ผ€์ค„๋ง/SLO. ์‹ ๋ขฐ์„ฑ์€ ์ƒ๋Œ€์ ์œผ๋กœ ๋น„์–ด ์žˆ์–ด ์—ฐ๊ตฌ ๊ธฐํšŒ๊ฐ€ ํฌ๋‹ค.

ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€ โ€” LLM ์„œ๋น™์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ฒฝ์€ HBM(๋Œ€์—ญํญ)+CXL(์šฉ๋Ÿ‰ยทํ’€๋ง)+๊ณ์—ฐ์‚ฐ(PIM/PNM)์˜ ์ด๊ธฐ์ข… ๊ณ„์ธต์œผ๋กœ ๊ตฌ์กฐ์ ์œผ๋กœ ํ•ด์†Œ๋˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ˆ˜๋ ดํ•˜๊ณ  ์žˆ๋‹ค. KV ์บ์‹œ ๊ด€๋ฆฌ(์ž‘๊ฒŒยท์žฌ์‚ฌ์šฉยท์–ด๋””๋‘๋‚˜)์™€ ์—ฐ์‚ฐ ๋ฐฐ์น˜(P/D ๋ถ„๋ฆฌยท๋ฉ”๋ชจ๋ฆฌ ๊ณ ์—ฐ์‚ฐ)๋Š” ์„œ๋กœ ์ง๊ต์ ์ด์–ด์„œ ํ•จ๊ป˜ ์ ์šฉ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๋ชจ๋“  ๊ฐˆ๋ž˜๋ฅผ ๊ฐ€๋กœ์ง€๋ฅด๋Š” ์ •๋ฐ€๋„ยท์‹ ๋ขฐ์„ฑยทSLO ์ถ•์—์„œ, ํŠนํžˆ 'reliability-aware ํ•œ KV ํ‹ฐ์–ด๋ง/์˜คํ”„๋กœ๋”ฉ'(์–ด๋А KV๋ฅผ ์–ด๋А ํ‹ฐ์–ด์— ๋ช‡ ๋น„ํŠธยท์–ด๋–ค ECC๋กœ ๋‘˜์ง€๋ฅผ ์ •ํ™•๋„ SLO์™€ ํ•จ๊ป˜ ์ •์‹ํ™”)์ด ์„ฑ๋Šฅ ์ค‘์‹ฌ ์—ฐ๊ตฌ๊ฐ€ ์ฑ„์šฐ์ง€ ๋ชปํ•œ ๊ฐ€์žฅ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋นˆ์นธ์ด๋‹ค.

์ฃผ์˜ โ€” ๋ณธ ์ •๋ฆฌ๋Š” ๊ฐ ๋…ผ๋ฌธยท์„œ๋ฒ ์ด(KV ์บ์‹œ ๊ด€๋ฆฌ ์„œ๋ฒ ์ด arXiv 2412.19442, ํšจ์œจ์  ์ถ”๋ก  ์„œ๋ฒ ์ด arXiv 2404.14294, ํ•˜๋“œ์›จ์–ด ๊ด€์  ์„œ๋ฒ ์ด arXiv 2410.04466, PagedAttention arXiv 2309.06180, DistServe arXiv 2401.09670, vAttention arXiv 2405.04437, InfiniGen arXiv 2406.19707, MeanCache arXiv 2403.02694, ๊ทธ๋ฆฌ๊ณ  ๋ณธ๋ฌธ์— ์ธ์šฉํ•œ ๊ฐœ๋ณ„ ๋…ผ๋ฌธ๋“ค)์— ๊ธฐ๋ฐ˜ํ•œ๋‹ค. ํ‘œ์˜ '๋ฐœํ‘œ์ฒ˜ยท์—ฐ๋„ยท์ˆ˜์น˜'๋Š” ๋ณด๊ณ ๊ฐ’์„ ์š”์•ฝํ•œ ๊ฒƒ์œผ๋กœ ๋ฒ„์ „ยท์กฐ๊ฑด์— ๋”ฐ๋ผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ผ๋ถ€ arXiv ์ „์šฉ/ํ”„๋ฆฌํ”„๋ฆฐํŠธ๊ฐ€ ํฌํ•จ๋œ๋‹ค. ๋น ๋ฅด๊ฒŒ ๋ฐœ์ „ํ•˜๋Š” ๋ถ„์•ผ๋ผ ๋ณธ ๋ชฉ๋ก์€ ๋Œ€ํ‘œ ์‚ฌ๋ก€์˜ ์ผ๋ถ€์ด๋ฉฐ ์™„์ „ํ•˜์ง€ ์•Š๋‹ค. ์ •ํ™•ํ•œ ์ˆ˜์น˜ยท๋ฐฉ๋ฒ•์€ ์›๋ฌธ ํ™•์ธ์ด ํ•„์š”ํ•˜๋‹ค.

๊ธฐ๋ฒ• ๋ฐœํ‘œ์ฒ˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ
StreamingLLM ICLR 2024 attention sink(์ดˆ๊ธฐ ํ† ํฐ) + ์ตœ๊ทผ ์œˆ๋„์šฐ๋กœ ๋ฌดํ•œ ์ŠคํŠธ๋ฆฌ๋ฐ, 4M ํ† ํฐยท22.2๋ฐฐ (arXiv 2309.17453)
H2O NeurIPS 2023 heavy-hitter(๋ˆ„์  ์–ดํ…์…˜ ์ ์ˆ˜ ์ƒ์œ„) + ์ตœ๊ทผ ํ† ํฐ๋งŒ ์œ ์ง€ํ•˜๋Š” ๋™์  ์ถ•์ถœ (arXiv 2306.14048)
Scissorhands NeurIPS 2023 '์ค‘์š”๋„ ์ง€์†์„ฑ ๊ฐ€์„ค' โ€” ์ค‘์š” ํ† ํฐ์€ ๊ณ„์† ์ค‘์š”, ํ…Œ์ŠคํŠธ ์‹œ KV ์••์ถ•
SnapKV NeurIPS 2024 ํ”„๋กฌํ”„ํŠธ ๋ ๊ด€์ธก ์œˆ๋„์šฐ๋กœ ์ค‘์š” ์œ„์น˜๋ฅผ prefill ๋‹จ๊ณ„์— ์„ ๋ณ„ (arXiv 2404.14469)
Quest ICML 2024 ์งˆ์˜ ์ธ์‹ ํฌ์†Œ์„ฑ โ€” KV๋ฅผ ํŽ˜์ด์ง€๋กœ ๋ฌถ์–ด ์งˆ์˜ ๊ด€๋ จ ํŽ˜์ด์ง€๋งŒ ๋™์  ๋กœ๋“œ(์ „์ฒด๋Š” ๋ณด๊ด€)
Ada-KV / NACL 2024 ์ ์‘์  ์˜ˆ์‚ฐ ๋ฐฐ๋ถ„(Ada-KV)ยทํŽธํ–ฅ ์—†๋Š” ๋ฌด์ž‘์œ„ ์ถ•์ถœ(NACL)๋กœ ์ถ•์ถœ ํ’ˆ์งˆ ๊ฐœ์„ 
Keyformer / BUZZ MLSys/2024 key ํ† ํฐ ์„ ํƒ(Keyformer)ยท๋ฒŒ์ง‘ํ˜• ์„ธ๊ทธ๋จผํŠธ heavy hitter(BUZZ)
๊ธฐ๋ฒ• ๋ฐœํ‘œ์ฒ˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ
KIVI ICML 2024 Key=per-channel, Value=per-token ๋น„๋Œ€์นญ 2๋น„ํŠธ ์–‘์žํ™”, ๋ฌดํ•™์Šต (arXiv 2402.02750)
KVQuant NeurIPS 2024 per-channel Key + non-uniform(NUQ) 3๋น„ํŠธ ๋ฏธ๋งŒ, 1์ฒœ๋งŒ ๋ฌธ๋งฅ ์ง€์› (arXiv 2401.18079)
Oaken 2025 online-offline ํ•˜์ด๋ธŒ๋ฆฌ๋“œ KV ์–‘์žํ™”๋กœ ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ธ ์„œ๋น™
๊ธฐ๋ฒ• ๋ฐœํ‘œ์ฒ˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ
GQA / MQA 2023 KV ํ—ค๋“œ ์ˆ˜๋ฅผ ์ค„์—ฌ(๊ทธ๋ฃน/๋‹จ์ผ) KV ์บ์‹œ ์ถ•์†Œ (arXiv 2305.13245 / 1911.02150)
MLA (DeepSeek-V2) 2024 ์ €์ฐจ์› latent๋กœ KV ์••์ถ•(KV 93% ๊ฐ์†Œ)ยทdecoupled RoPE (arXiv 2405.04434)
Sliding Window Attn 2020~ ๊ฐ ํ† ํฐ์ด ์ตœ๊ทผ W๊ฐœ๋งŒ ๋ณด๋Š” ๊ตญ์†Œ ์–ดํ…์…˜, ๋ ˆ์ด์–ด๋กœ receptive field ํ™•๋Œ€(LongformerยทMistral)
Gemma 2 ์ธํ„ฐ๋ฆฌ๋ธŒ 2024 ๊ตญ์†Œ SWA โ†” ์ „์—ญ full ๋ ˆ์ด์–ด ๊ต๋Œ€๋กœ ํšจ์œจยท์žฅ๊ฑฐ๋ฆฌ ๊ท ํ˜• (arXiv 2408.00118)
๊ธฐ๋ฒ• ๋ฐœํ‘œ์ฒ˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ
PagedAttention / vLLM SOSP 2023 OS ํŽ˜์ด์ง•์ฒ˜๋Ÿผ KV๋ฅผ ๋ธ”๋ก ๋‹จ์œ„๋กœ ๊ด€๋ฆฌํ•ด ๋‹จํŽธํ™” ์ œ๊ฑฐ(60~80%โ†’4%) (arXiv 2309.06180)
vAttention ASPLOS 2025 CUDA ๊ฐ€์ƒ๋ฉ”๋ชจ๋ฆฌ(VMM)๋กœ ์—ฐ์† ๊ฐ€์ƒ์ฃผ์†Œ + ๋™์  ๋ฌผ๋ฆฌํ• ๋‹น, vLLM ๋Œ€๋น„ 1.97๋ฐฐ (arXiv 2405.04437)
RadixAttention / SGLang 2024 ํ† ํฐ ๋‹จ์œ„ radix tree๋กœ ํ”„๋ฆฌํ”ฝ์Šค ์ž๋™ ๊ณต์œ ยท์žฌ์‚ฌ์šฉ, ๋ฉ€ํ‹ฐํ„ด 10~20% (arXiv 2312.07104)
vLLM APC 2023~ ๋ธ”๋ก ํ•ด์‹œ ๊ธฐ๋ฐ˜ ์ž๋™ ํ”„๋ฆฌํ”ฝ์Šค ์บ์‹ฑ(Automatic Prefix Caching) + LRUยทref-count
Prompt Caching (ํ”„๋กœ๋ฐ”์ด๋”) 2024~ ์ •์  ํ”„๋ฆฌํ”ฝ์Šค ์บ์‹ฑ API(์ž๋™/๋ช…์‹œ) โ€” ๋น„์šฉยท์ง€์—ฐ ๋Œ€ํญ ์ ˆ๊ฐ
GPTCache (Semantic) 2023~ ์งˆ๋ฌธ ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„๋กœ '์‘๋‹ต ์ž์ฒด' ์žฌ์‚ฌ์šฉ(๊ทผ์‚ฌ์ ), LLM ํ˜ธ์ถœ ํšŒํ”ผ (arXiv 2403.02694)
Jenga SOSP 2025 ์ด๊ธฐ์ข… ์ž„๋ฒ ๋”ฉ์„ ๊ณ ๋ คํ•œ PagedAttention ํ™•์žฅ, ํšจ๊ณผ์  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ
๊ธฐ๋ฒ• ๋ฐœํ‘œ์ฒ˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ
FlexGen ICML 2023 ๋‹จ์ผ GPU์„œ ๊ฐ€์ค‘์น˜ยทKV๋ฅผ CPUยท๋””์Šคํฌ๋กœ ์˜คํ”„๋กœ๋“œ, LP ๊ธฐ๋ฐ˜ ์ตœ์  ๋ฐฐ์น˜ (arXiv 2303.06865)
DeepSpeed-Inference SC 2022 ํ˜„์žฌ ๋ ˆ์ด์–ด๋งŒ GPU, ๋‚˜๋จธ์ง€ host๋กœ ์˜คํ”„๋กœ๋“œ(ZeRO-Inference)
PowerInfer SOSP 2024 ํ™œ์„ฑ ๋นˆ๋„ ๋†’์€ hot ๋‰ด๋Ÿฐ์€ GPU, cold๋Š” CPU โ€” neuron-aware ์˜คํ”„๋กœ๋”ฉ (arXiv 2312.12456)
LMCache 2024~ ์—”์ง„(vLLM/SGLang)์—์„œ KV ์ถ”์ถœยท๋‹ค์ธต(GPU/CPU/๋””์Šคํฌ) ์ €์žฅยท๊ณต์œ , ์ตœ๋Œ€ 15๋ฐฐ (arXiv 2510.09665)
CacheGen SIGCOMM 2024 KV ๋ถ„ํฌ ํŠน์„ฑ ํ™œ์šฉ ํ…์„œ ์ธ์ฝ”๋”๋กœ ์••์ถ•ยท์ŠคํŠธ๋ฆฌ๋ฐ, ํฌ๊ธฐ 3.5~4.3๋ฐฐโ†“ (arXiv 2310.07240)
InfiniGen OSDI 2024 ๋‹ค์Œ ๋ ˆ์ด์–ด ์ค‘์š” ํ† ํฐ์„ rehearsal๋กœ ์˜ˆ์ธกํ•ด ํ•„์š”ํ•œ KV๋งŒ prefetch, 3๋ฐฐ (arXiv 2406.19707)
Select-N / Aqua 2025 SLOยท๋„คํŠธ์›Œํฌ๋ฅผ ๊ณ ๋ คํ•œ ์„ ํƒ์  ์˜คํ”„๋กœ๋”ฉ์œผ๋กœ ์ง€์—ฐ ๋ณด์žฅ(Aqua: ASPLOS'25)
๊ธฐ๋ฒ• ๋ฐœํ‘œ์ฒ˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ
Orca OSDI 2022 ์—ฐ์† ๋ฐฐ์นญ(iteration-level batching)์œผ๋กœ head-of-line blocking ์™„ํ™”ยท์ฒ˜๋ฆฌ๋Ÿ‰โ†‘
Sarathi-Serve OSDI 2024 chunked prefill + stall-free ๋ฐฐ์นญ์œผ๋กœ colocation ๊ฐ„์„ญ ์™„ํ™”, 2.6~5.6๋ฐฐ (arXiv 2403.02310)
DistServe OSDI 2024 prefill/decode๋ฅผ ๋‹ค๋ฅธ GPU์— ๋ถ„๋ฆฌ, goodput ์ตœ์ ํ™”ยท์ž์› ๊ณต๋™ ์ตœ์ ํ™” (arXiv 2401.09670)
Splitwise ISCA 2024 prompt/token ๋‹จ๊ณ„๋ฅผ ๋‹ค๋ฅธ ๋จธ์‹ ยท์ด์ข… GPU์— ๋ถ„๋ฆฌ, 2.35๋ฐฐ ์ฒ˜๋ฆฌ๋Ÿ‰ (arXiv 2311.18677)
Mooncake FAST 2025 KVCache ์ค‘์‹ฌ ๋ถ„๋ฆฌ + CPU/DRAM/SSD ๋ถ„๋ฆฌํ˜• KV ํ’€, ์ตœ๋Œ€ 525% (arXiv 2407.00079)
TetriInfer / Nexus 2024~ ์š”์ฒญ์„ ์ง€์—ฐ ํด๋ž˜์Šค๋กœ ๋ผ์šฐํŒ…(TetriInfer)ยทintra-GPU ๋ถ„๋ฆฌ(Nexus)
๊ธฐ๋ฒ• ๋ฐœํ‘œ์ฒ˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ
AttAcc ASPLOS 2024 NPU+HBM-PIM ์ด์ข… ๊ฐ€์†์œผ๋กœ ์–ดํ…์…˜(GEMV) ์ฒ˜๋ฆฌ, ๋””์ฝ”๋”ฉ ๋Œ€์—ญํญ ๋ณ‘๋ชฉ ์™„ํ™”
NeuPIMs ASPLOS 2024 GEMM-NPU + GEMV-PIM ๋™์‹œ ํ™œ์šฉ, ๋ฐฐ์น˜ ์ถ”๋ก  ๊ฐ€์†(์ด์ค‘ ํ–‰๋ฒ„ํผ ๋“ฑ)
IANUS / SpecPIM ASPLOS 2024 NPU-PIM ํ†ตํ•ฉ ๋ฉ”๋ชจ๋ฆฌ(IANUS)ยทspeculative decoding์˜ PIM ๊ฐ€์†(SpecPIM)
LPDDR-CXL-PNM HPCA 2024 LPDDR ๊ธฐ๋ฐ˜ CXL-PNM ํ”Œ๋žซํผ์œผ๋กœ TCO ํšจ์œจ์  Transformer(GPT) ์ถ”๋ก 
Pond ASPLOS 2023 CXL ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ํ’€๋ง ์‹œ์Šคํ…œ(ํด๋ผ์šฐ๋“œ) โ€” stranded capacity ์ ˆ๊ฐ
CXL-NDP 2025 ํ‘œ์ค€ CXL.mem ์œ ์ง€, ๋น„ํŠธํ”Œ๋ ˆ์ธ ๋ ˆ์ด์•„์›ƒ+๋ฌด์†์‹ค ์••์ถ•์œผ๋กœ ์œ ํšจ ๋Œ€์—ญํญ ์ฆํญ(KV 46.9%โ†“)
Scalable CXL-PNM 2025 1M ํ† ํฐยท405B์„œ ํ† ํฐ ํŽ˜์ด์ง€ ์„ ํƒ์„ CXL ๋‚ด๋ถ€ ๊ฐ€์†๊ธฐ์„œ ์ˆ˜ํ–‰, 21.9๋ฐฐ ์ฒ˜๋ฆฌ๋Ÿ‰ (arXiv 2511.00321)
Sangam 2025 ์นฉ๋ › DRAM-PIM + CXL ํ†ตํ•ฉ ๊ฐ€์†๊ธฐ๋กœ LLM ์ถ”๋ก 
๊ธฐ๋ฒ• ๋ฐœํ‘œ์ฒ˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ
Kelle MICRO 2025 eDRAM+KV ์บ์‹ฑ ๊ณต๋™์„ค๊ณ„ โ€” ์ค‘์š” ํ† ํฐ์€ ๋†’์€ refresh, ๋น„์ค‘์š”๋Š” ๋‚ฎ๊ฒŒ(์ค‘์š”๋„ ์ธ์‹)
SHIELD 2026 ๋ถ„ํ•  ๊ณ„์ธต ๋ฉ”๋ชจ๋ฆฌ โ€” KV(์˜์†)์™€ QO(์ผ์‹œ) ์˜ค๋ฅ˜ ๋‚ด์„ฑ ์ฐจ์ด๋ฅผ ์ด์šฉํ•œ lifecycle-aware refresh
KVNAND 2025 DRAM-free in-flash KV โ€” softmax ์˜ค๋ฅ˜ ๋งˆ์Šคํ‚น์œผ๋กœ KV๊ฐ€ ๊ฐ€์ค‘์น˜๋ณด๋‹ค ์˜ค๋ฅ˜ ๋‚ด์„ฑโ†‘
GPU soft-error ์—ฐ๊ตฌ 2025~ ๋ช…๋ น ์ˆ˜์ค€ ๊ฒฐํ•จ ์ฃผ์ž… โ€” ๋‹จ์ผ ๋น„ํŠธ๋Š” ๋งˆ์Šคํ‚น, ๋ˆ„์  ์‹œ ๊ธ‰๊ฒฉ ์—ดํ™”ยทํฐ ๋ชจ๋ธ์ด ๋” ๊ฐ•๊ฑด
FineServe 2025 ์ •๋ฐ€๋„ ์ธ์‹ KV slab + 2๋‹จ ์Šค์ผ€์ค„๋ง โ€” ์–‘์žํ™” ํŠน์„ฑ๋ณ„ KV ํ• ๋‹น์œผ๋กœ ๋‹จํŽธํ™”โ†“