Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

Architecture Venues LLM Memory Papers

์•„ํ‚คํ…์ฒ˜ ํ•™ํšŒ์˜ ๋ฉ”๋ชจ๋ฆฌ ๊ด€์  LLM ์ตœ์ ํ™” ๋…ผ๋ฌธ (2024~2026)

ISCA ยท MICRO ยท HPCA ยท ASPLOS ยท DAC ยท PACT ยท DATE โ€” KV ์บ์‹œยทPIM/PNMยทCXLยทin-storageยท์–‘์žํ™”ยท์„œ๋น™ยท์‹ ๋ขฐ์„ฑ

๋ณธ ๋ฌธ์„œ๋Š” 2024~2026๋…„ ์ฃผ์š” ์ปดํ“จํ„ฐ ์•„ํ‚คํ…์ฒ˜ ํ•™ํšŒ(ISCAยทMICROยทHPCAยทASPLOSยทDACยทPACTยทDATE)์— ๊ฒŒ์žฌ๋œ, '๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ๊ด€์ '์—์„œ LLM์„ ์ตœ์ ํ™”ํ•œ ๋…ผ๋ฌธ๋“ค์„ ๋ชจ์•„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ๊ด€์ ์ด๋ž€ KV ์บ์‹œยท๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญยท์šฉ๋Ÿ‰ยท๋ฐ์ดํ„ฐ ์ด๋™์„ 1๊ธ‰ ์ œ์•ฝ์œผ๋กœ ๋ณด๋Š” ์‹œ๊ฐ์œผ๋กœ, ๋ณธ ๋ฌธ์„œ๋Š” ์ด๋ฅผ 6๊ฐˆ๋ž˜ โ€” PIM/PNM ๊ฐ€์†, CXLยท๋ฉ”๋ชจ๋ฆฌ ํ’€๋ง, in-storageยทDIMM-NDP, KV ์–‘์žํ™”ยท์••์ถ•, ํฌ์†Œยท์ถ•์ถœยท์–ดํ…์…˜ IO, ์„œ๋น™ยทP/D ๋ถ„๋ฆฌยท์‹ ๋ขฐ์„ฑ โ€” ๋กœ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋…ผ๋ฌธ์€ ๋ฐœํ‘œ ํ•™ํšŒยท์—ฐ๋„ยทํ•ต์‹ฌ ๊ธฐ์—ฌ๋ฅผ ํ•จ๊ป˜ ํ‘œ๊ธฐํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ฒ”์œ„์™€ ์ •ํ™•์„ฑ์— ๊ด€ํ•œ ์ฃผ์˜: ํ•™ํšŒยท์—ฐ๋„๋Š” ๊ณต์‹ ํ”„๋กœ๊ทธ๋žจ๊ณผ venue-ํƒœ๊น… ์ž๋ฃŒ๋กœ ํ™•์ธํ–ˆ์œผ๋‚˜, ์ผ๋ถ€ ํ•ญ๋ชฉ์€ arXiv ๋‹จ๊ณ„๋กœ 7๊ฐœ ํ•™ํšŒ ๊ฒŒ์žฌ๊ฐ€ ํ™•์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค(ํ‘œ์— 'arXiv'๋กœ ํ‘œ๊ธฐ). ์ด๋Ÿฐ ํ•ญ๋ชฉ์€ ํ”„๋ฆฌํ”„๋ฆฐํŠธ์ด๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ํ•™ํšŒ(์˜ˆ: OSDIยทFASTยทMLSysยทEMNLP) ๋ฐœํ‘œ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ ๋น ๋ฅด๊ฒŒ ์Ÿ์•„์ง€๋Š” ๋ถ„์•ผ๋ผ ๋ณธ ๋ชฉ๋ก์€ ๋Œ€ํ‘œ ์‚ฌ๋ก€ ๋ชจ์Œ์ด๋ฉฐ ์™„์ „ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. PhD ์—ฐ๊ตฌ์— ์ธ์šฉํ•˜์‹ค ๋•Œ๋Š” ๊ฐ ๋…ผ๋ฌธ์˜ ์ถœ์ฒ˜๋ฅผ ์ง์ ‘ ํ™•์ธํ•˜์‹œ๊ธธ ๊ถŒํ•ฉ๋‹ˆ๋‹ค.

1. ํ•™ํšŒ ร— ์—ฐ๋„ ์ง€ํ˜•

ํ•™ํšŒ ์—ฐ๋„ ์ง€ํ˜•

๊ทธ๋ฆผ 1. 7๊ฐœ ํ•™ํšŒ ร— 2024~2026 ์ง€ํ˜• (์…€ ์•ˆ์€ ๋Œ€ํ‘œ ๋…ผ๋ฌธ์ด๋ฉฐ ์ „์ฒด๊ฐ€ ์•„๋‹˜)

๋ฉ”๋ชจ๋ฆฌ ๊ด€์ ์˜ LLM ์ตœ์ ํ™”๋Š” ISCAยทMICROยทHPCAยทASPLOS์— ์ง‘์ค‘๋˜์–ด ์žˆ์œผ๋ฉฐ, ํŠนํžˆ HPCA 2025์™€ ASPLOS 2026์ด ํ’๋ถ€ํ•ฉ๋‹ˆ๋‹ค. DACยทPACTยทDATE์—๋„ ๊ด€๋ จ ๋…ผ๋ฌธ์ด ์žˆ์œผ๋‚˜ ํŽธ์ˆ˜๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์ ์Šต๋‹ˆ๋‹ค. 2026๋…„์€ ์ผ๋ถ€๋งŒ ๊ณต๊ฐœ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค(HPCAยทASPLOS ์œ„์ฃผ, MICROยทISCA 2026์€ ์ดํ›„ ๊ฐœ์ตœ/๊ณต๊ฐœ).

2. ๋ถ„๋ฅ˜ โ€” ๋ฉ”๋ชจ๋ฆฌ ๊ด€์  6๊ฐˆ๋ž˜

๋ฉ”๋ชจ๋ฆฌ ๊ด€์  6๊ฐˆ๋ž˜ ๋ถ„๋ฅ˜

๊ทธ๋ฆผ 2. ๋ณธ ๋…ผ๋ฌธ ๋ชจ์Œ์˜ 6๊ฐˆ๋ž˜ ๋ถ„๋ฅ˜์™€ ๋Œ€ํ‘œ ๋…ผ๋ฌธ(ํ•™ํšŒ ํƒœ๊ทธ)

๋ถ„๋ฅ˜ ์ง์ ‘ ๊ฒจ๋ƒฅํ•˜๋Š” ๋ณ‘๋ชฉ ์ฃผ๋œ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜ ๋Œ€ํ‘œ ๋…ผ๋ฌธ ์˜ˆ์‹œ ์‹œ์Šคํ…œ ๊ด€์  ํ•ต์‹ฌ ์งˆ๋ฌธ
PIM/PNM ๊ฐ€์†๊ธฐ decode ์‹œ GEMV ๋Œ€์—ญํญ HBM-PIM, near-memory AttAcc, NeuPIMs, Pimba GPU๋กœ ์˜ฎ๊ธฐ๊ธฐ ์ „์— ์–ด๋””๊นŒ์ง€ ๋ฉ”๋ชจ๋ฆฌ ๊ณ์—์„œ ๋๋‚ผ ์ˆ˜ ์žˆ๋Š”๊ฐ€
CXLยท๋ฉ”๋ชจ๋ฆฌ ํ’€๋ง ์šฉ๋Ÿ‰ ๋ถ€์กฑ, ๋ฉ”๋ชจ๋ฆฌ ๊ณต์œ  CXL Type-2/3, pooled DRAM LPDDR-CXL-PNM, SkyByte, Cohet ํฐ KV๋ฅผ ์–ด๋А ํ‹ฐ์–ด์— ๋‘๊ณ  ์–ด๋–ป๊ฒŒ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•  ๊ฒƒ์ธ๊ฐ€
In-storageยทDIMM-NDP GPU ๋ฐ– ๋Œ€์šฉ๋Ÿ‰ KV ์ฒ˜๋ฆฌ DIMM, SSD, CSD InstAttention, Lincoln, Hermes ๋А๋ฆฐ ํ‹ฐ์–ด์˜ ์šฉ๋Ÿ‰ ์ด๋“์„ ์ง€์—ฐ ์ฆ๊ฐ€๋ณด๋‹ค ํฌ๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š”๊ฐ€
KV ์–‘์žํ™”ยท์••์ถ• KV ์ €์žฅ ์šฉ๋Ÿ‰, ์ฝ๊ธฐ ๋Œ€์—ญํญ HBM, DRAM, ์ „์šฉ ์••์ถ• ๊ฒฝ๋กœ Oaken, BitMoD, ZipServ ์ •ํ™•๋„ ์†์‹ค์„ ํ—ˆ์šฉ ๋ฒ”์œ„ ์•ˆ์— ๋‘๊ณ  ๋ช‡ ๋น„ํŠธ๊นŒ์ง€ ๋‚ฎ์ถœ ์ˆ˜ ์žˆ๋Š”๊ฐ€
ํฌ์†Œยท์ถ•์ถœยท์–ดํ…์…˜ IO ๋ถˆํ•„์š”ํ•œ ํ† ํฐ ์ ‘๊ทผ, IO ์ด๋™ KV ์บ์‹œ ์ „์ฒด ๊ฒฝ๋กœ ALISA, PAT, V-Rex ๋ชจ๋“  ํ† ํฐ์„ ์ฝ์ง€ ์•Š๊ณ ๋„ ํ’ˆ์งˆ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€
์„œ๋น™ยทP/D ๋ถ„๋ฆฌยท์‹ ๋ขฐ์„ฑ ๋‹จ๊ณ„ ๊ฐ„ ๊ฐ„์„ญ, SLO, ์˜ค๋ฅ˜ ๋ฏผ๊ฐ๋„ ํด๋Ÿฌ์Šคํ„ฐ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ์ „๋ฐ˜ Splitwise, Bullet, Kelle ์–ด๋–ค ์š”์ฒญ๊ณผ ์–ด๋–ค KV๊ฐ€ ๋” ๋น ๋ฅธ ํ‹ฐ์–ดยท๊ฐ•ํ•œ ๋ณดํ˜ธ๋ฅผ ๋ฐ›์•„์•ผ ํ•˜๋Š”๊ฐ€

3. PIM / PNM ๊ฐ€์†๊ธฐ โ€” ๋ฉ”๋ชจ๋ฆฌ ๊ณ์—์„œ GEMV ์ฒ˜๋ฆฌ

๋””์ฝ”๋”ฉ์˜ GEMV(๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ)๋ฅผ ๋ฉ”๋ชจ๋ฆฌ ๊ณ(PIM/PNM)์—์„œ ์ฒ˜๋ฆฌํ•˜๊ณ , prefill์˜ GEMM์€ NPU/GPU๊ฐ€ ๋งก๋Š” ์ด์ข… ๊ฐ€์†์ด ํ•ต์‹ฌ ํŒจํ„ด์ž…๋‹ˆ๋‹ค.

์ด ๊ณ„์—ด์€ GPU๊ฐ€ ๋ฐ˜๋ณต์ ์œผ๋กœ HBM์—์„œ KV๋ฅผ ์ฝ์–ด ์˜ค๋Š” ๋น„์šฉ์„ ์ค„์ด๋Š” ๋ฐ ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ AttAcc, NeuPIMs, PAISE, Pimba ๊ฐ™์€ ๋…ผ๋ฌธ์€ "์—ฐ์‚ฐ์„ ๋” ๋น ๋ฅด๊ฒŒ" ํ•˜๊ธฐ๋ณด๋‹ค "๋ฐ์ดํ„ฐ๋ฅผ ๋œ ์›€์ง์ด๊ฒŒ" ๋งŒ๋“œ๋Š” ์ชฝ์ด decode ์ง€์—ฐ์— ๋” ์ง์ ‘์ ์ด๋ผ๋Š” ์ ์„ ๋ณด์—ฌ ์ค๋‹ˆ๋‹ค.

4. CXL ยท ๋ฉ”๋ชจ๋ฆฌ ํ’€๋ง โ€” ์šฉ๋Ÿ‰ ํ™•์žฅ๊ณผ ์ฝ”ํžˆ์–ด๋ŸฐํŠธ ํ‹ฐ์–ด

CXL์€ ์šฉ๋Ÿ‰์„ ํ™•์žฅํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ํ‹ฐ์–ด์ด์ž, ๊ณ ์—ฐ์‚ฐ(PNM)์„ ์–น๋Š” ๊ณ„์‚ฐ ๊ธฐํŒ์œผ๋กœ ์ง„ํ™”ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์ถ•์˜ ํ•ต์‹ฌ์€ HBM๋งŒ์œผ๋กœ๋Š” ๊ฐ๋‹นํ•˜๊ธฐ ์–ด๋ ค์šด ๊ธด ๋ฌธ๋งฅ๊ณผ ๋งŽ์€ ๋™์‹œ ์š”์ฒญ์„ ์™ธ๋ถ€ ๋ฉ”๋ชจ๋ฆฌ ํ‹ฐ์–ด๋กœ ๋„˜๊ธฐ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋งํฌ ์ง€์—ฐ, page placement, coherence ๋น„์šฉ์ด ์ปค์ง€๋ฏ€๋กœ, ์–ด๋–ค KV๋ฅผ CXL๋กœ ๋ฐ€์–ด๋‚ผ์ง€ ๊ฒฐ์ •ํ•˜๋Š” ์ •์ฑ…์ด ์„ฑ๋Šฅ์„ ์ขŒ์šฐํ•ฉ๋‹ˆ๋‹ค.

5. In-storage ยท DIMM-NDP โ€” KV/์–ดํ…์…˜์„ ํ”Œ๋ž˜์‹œยทDIMM์œผ๋กœ

GPU ๋ฐ– ๋Œ€์šฉ๋Ÿ‰ ๋งค์ฒด(ํ”Œ๋ž˜์‹œยทDIMM)์— KV๋ฅผ ๋‘๊ณ  ๊ทธ ์ž๋ฆฌ์—์„œ ์ฒ˜๋ฆฌํ•ด ์ „์†ก ๋ณ‘๋ชฉ์„ ์ค„์ž…๋‹ˆ๋‹ค.

์ด ์ ‘๊ทผ์€ ๊ฐ€์žฅ ๋А๋ฆฐ ํ‹ฐ์–ด๋ฅผ ์“ฐ๋Š” ๋Œ€์‹  ๊ฐ€์žฅ ํฐ ์šฉ๋Ÿ‰์„ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค. InstAttention, Lincoln, Hermes ๊ณ„์—ด์€ SSD๋‚˜ DIMM ๊ทผ์ฒ˜์—์„œ ์ผ๋ถ€ attention ๋˜๋Š” KV ์ ‘๊ทผ์„ ๋๋‚ด ์ „์†ก๋Ÿ‰์„ ์ค„์ด์ง€๋งŒ, ๊ทธ๋งŒํผ ์†Œํ”„ํŠธ์›จ์–ด ๋Ÿฐํƒ€์ž„๊ณผ ์žฅ์น˜ ์Šค์ผ€์ค„๋Ÿฌ์˜ ์—ญํ• ์ด ์ปค์ง‘๋‹ˆ๋‹ค.

6. KV ์–‘์žํ™” ยท ์••์ถ• (ํ•˜๋“œ์›จ์–ด/์‹œ์Šคํ…œ)

KV ์บ์‹œ์˜ ๋น„ํŠธ์ˆ˜ยท์ •๋ฐ€๋„๋ฅผ ์ค„์—ฌ ์šฉ๋Ÿ‰ยท๋Œ€์—ญํญ์„ ์ ˆ๊ฐํ•˜๋ฉฐ, ์ „์šฉ PEยท๋ฐ์ดํ„ฐํ˜•์‹์œผ๋กœ ๊ฐ€์†ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ถ„๋ฅ˜๋Š” ๊ฐ€์žฅ ๋ฒ”์šฉ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค. ๊ฐ™์€ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ์„œ๋น™ ์Šคํƒ์—์„œ๋„ KV ๋น„ํŠธํญ๋งŒ ์กฐ์ •ํ•ด ์ฆ‰์‹œ ์ˆ˜์šฉ๋Ÿ‰์„ ๋†’์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, Oaken, BitMoD, MANT, ZipServ์ฒ˜๋Ÿผ ํ•˜๋“œ์›จ์–ด ๊ตฌ์กฐ์™€ ์ˆ˜์น˜ ํ˜•์‹์„ ํ•จ๊ป˜ ์„ค๊ณ„ํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋น ๋ฅด๊ฒŒ ๋Š˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

7. ํฌ์†Œ ยท ์ถ•์ถœ ยท ์–ดํ…์…˜ IO

ํ† ํฐ์„ ์„ ํƒยท์ถ•์ถœํ•ด KV๋ฅผ ์ค„์ด๊ฑฐ๋‚˜, ์–ดํ…์…˜์˜ ๋ฐ์ดํ„ฐ ์ด๋™(IO)์„ ๋ถ„์„ยท์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ทธ๋ฃน์€ "๋ชจ๋“  KV๋ฅผ ๋๊นŒ์ง€ ์œ ์ง€ํ•ด์•ผ ํ•˜๋Š”๊ฐ€"๋ผ๋Š” ์งˆ๋ฌธ์—์„œ ์ถœ๋ฐœํ•ฉ๋‹ˆ๋‹ค. ALISA, PAT, V-Rex, I/O Analysis ๊ณ„์—ด์€ ์ค‘์š”ํ•œ ํ† ํฐ๋งŒ ๋‚จ๊ธฐ๊ฑฐ๋‚˜ ์ ‘๊ทผ ์ˆœ์„œ๋ฅผ ๋ฐ”๊ฟ”์„œ, ์ €์žฅ ์šฉ๋Ÿ‰๋ฟ ์•„๋‹ˆ๋ผ ์‹ค์ œ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ๊นŒ์ง€ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์„ ์ทจํ•ฉ๋‹ˆ๋‹ค.

8. ์„œ๋น™ ยท P/D ๋ถ„๋ฆฌ ยท ์‹ ๋ขฐ์„ฑ

์‹œ์Šคํ…œ ๊ณ„์ธต์˜ ์Šค์ผ€์ค„๋งยท๋ถ„๋ฆฌ(disaggregation)์™€, ๋ฉ”๋ชจ๋ฆฌ ์‹ ๋ขฐ์„ฑ(refreshยท์ค‘์š”๋„ ์ธ์‹)์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

Splitwise, Bullet, QoServe๋Š” ์š”์ฒญ ๋‹จ์œ„์˜ ์ง€์—ฐ๊ณผ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์กฐ์ ˆํ•˜๋Š” ์šด์˜ ๋ฌธ์ œ๋ฅผ ์ „๋ฉด์— ๋‘ก๋‹ˆ๋‹ค. Kelle์ฒ˜๋Ÿผ ์‹ ๋ขฐ์„ฑ์„ ์ •๋ฉด์œผ๋กœ ๋‹ค๋ฃจ๋Š” ์‚ฌ๋ก€๋Š” ์•„์ง ์ ์ง€๋งŒ, CXL๊ณผ ์ด๊ธฐ์ข… ํ‹ฐ์–ด๊ฐ€ ๋„“์–ด์งˆ์ˆ˜๋ก "์ค‘์š”ํ•œ KV๋ฅผ ๋” ์•ˆ์ „ํ•œ ํ‹ฐ์–ด์™€ ECC ์ •์ฑ…์— ๋‘˜ ๊ฒƒ์ธ๊ฐ€"๊ฐ€ ๋…๋ฆฝ์ ์ธ ์—ฐ๊ตฌ ์ถ•์œผ๋กœ ์ปค์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.

9. ์ข…ํ•ฉ โ€” ๋ถ๋น„๋Š” ์˜์—ญ๊ณผ ๋นˆ ์˜์—ญ

Architecture Venues LLM Memory Papers

๊ทธ๋ฆผ 4. ํ™œ๋ฐœํ•œ ์˜์—ญ๊ณผ ์ƒ๋Œ€์ ์œผ๋กœ ๋น„์–ด ์žˆ๋Š” ์˜์—ญ(์—ฐ๊ตฌ ๊ธฐํšŒ)

ํ•˜๋“œ์›จ์–ด ๋ฉ”๋ชจ๋ฆฌ ์Šคํƒ์˜ ๊ฐ ํ‹ฐ์–ด(HBMยทHBM-PIMยทCXLยทDIMMยทํ”Œ๋ž˜์‹œ)๋ฅผ ๋…ธ๋ฆฐ ๋…ผ๋ฌธ๋“ค์„ ํ•œ๋ˆˆ์— ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Architecture Venues LLM Memory Papers

๊ทธ๋ฆผ 3. ํ•˜๋“œ์›จ์–ด ๋ฉ”๋ชจ๋ฆฌ ์Šคํƒ ๊ด€์ ์—์„œ ๊ฐ ํ‹ฐ์–ด๋ฅผ ๊ณต๋žตํ•œ ๋Œ€ํ‘œ ๋…ผ๋ฌธ

ํ•ต์‹ฌ ๊ด€์ฐฐ

  • ๊ฐ€์žฅ ํ™œ๋ฐœํ•œ ์˜์—ญ์€ PIM/PNM ๋””์ฝ”๋”ฉ ๊ฐ€์†(NPU+HBM-PIMยทCXL-PNM), KV ์–‘์žํ™” ํ•˜๋“œ์›จ์–ด, P/D ๋ถ„๋ฆฌ ์„œ๋น™, in-storage ์˜คํ”„๋กœ๋”ฉ์ž…๋‹ˆ๋‹ค.

  • ํ•ต์‹ฌ ํŒจํ„ด์ด ์ˆ˜๋ ดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค โ€” 'GEMM=GPU/NPU, GEMV=PIM'์˜ ์—ญํ•  ๋ถ„๋‹ด๊ณผ, 'HBM(๋Œ€์—ญํญ)+CXL(์šฉ๋Ÿ‰ยทํ’€๋ง)+ํ”Œ๋ž˜์‹œ/DIMM(๋Œ€์šฉ๋Ÿ‰)'์˜ ์ด๊ธฐ์ข… ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต์ž…๋‹ˆ๋‹ค.

  • ์‹ ๋ขฐ์„ฑ์€ Kelle(MICRO 2025) ๋“ฑ ์†Œ์ˆ˜์— ๊ทธ์นฉ๋‹ˆ๋‹ค. KV ์บ์‹œ๊ฐ€ softmax ๋งˆ์Šคํ‚น์œผ๋กœ ๋น„ํŠธ ์˜ค๋ฅ˜์— ๋น„๊ต์  ๊ฐ•ํ•˜๋‹ค๋Š” ์ ์€ ์•Œ๋ ค์กŒ์ง€๋งŒ, ๋น„ํŠธ์˜ค๋ฅ˜ยทrefreshยทECC๋ฅผ ์ •๋ฉด์œผ๋กœ ๋‹ค๋ฃฌ ์•„ํ‚คํ…์ฒ˜ ๋…ผ๋ฌธ์€ ๊ฑฐ์˜ ์—†์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์ •๋ฆฌ โ€” 2024~2026๋…„ ISCAยทMICROยทHPCAยทASPLOS์—๋Š” ๋ฉ”๋ชจ๋ฆฌ ๊ด€์  LLM ์ตœ์ ํ™” ๋…ผ๋ฌธ์ด ํ’๋ถ€ํ•˜๋ฉฐ(ํŠนํžˆ HPCA 2025ยทASPLOS 2026), PIM/PNM ๊ฐ€์†ยทCXL/๋ฉ”๋ชจ๋ฆฌ ํ’€๋งยทin-storageยทKV ์–‘์žํ™”ยทํฌ์†Œ/์ถ•์ถœยท์„œ๋น™ ๋ถ„๋ฆฌ์˜ 6๊ฐˆ๋ž˜๋กœ ์ •๋ฆฌ๋œ๋‹ค. ๊ณตํ†ต์ ์œผ๋กœ ๋””์ฝ”๋”ฉ์˜ memory-bound GEMV๋ฅผ ๋ฉ”๋ชจ๋ฆฌ ๊ณ(PIM/PNM)์—์„œ ์ฒ˜๋ฆฌํ•˜๊ณ , HBMยทCXLยทํ”Œ๋ž˜์‹œ์˜ ์ด๊ธฐ์ข… ๊ณ„์ธต์œผ๋กœ ์šฉ๋Ÿ‰ยท๋Œ€์—ญํญ์„ ๋™์‹œ์— ํ‘ธ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ˆ˜๋ ดํ•œ๋‹ค. ๋ฐ˜๋ฉด ์‹ ๋ขฐ์„ฑ(๋น„ํŠธ์˜ค๋ฅ˜ยทrefreshยทECC ์ธ์‹)์€ Kelle ๋“ฑ ๊ทน์†Œ์ˆ˜๋กœ ๋น„์–ด ์žˆ๋‹ค.

์—ฐ๊ตฌ ๊ธฐํšŒ(๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ๊ด€์ ) โ€” '์–ด๋А KV๋ฅผ ยท ์–ด๋А ํ‹ฐ์–ด(HBM/CXL/DIMM/ํ”Œ๋ž˜์‹œ)์— ยท ๋ช‡ ๋น„ํŠธ๋กœ ยท ์–ด๋–ค ECC ๊ฐ•๋„๋กœ' ๋‘˜์ง€๋ฅผ ์ •ํ™•๋„ SLO์™€ ์„ฑ๋Šฅ SLO๋ฅผ ํ•จ๊ป˜ ๊ณ ๋ คํ•ด ์ •์‹ํ™”ํ•˜๋Š” reliability-aware KV ํ‹ฐ์–ด๋ง/์˜คํ”„๋กœ๋”ฉ์ด ๊ฐ€์žฅ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋นˆ์นธ์ด๋‹ค. ํ‹ฐ์–ด๋งˆ๋‹ค ๋น„ํŠธ ์˜ค๋ฅ˜ ํŠน์„ฑ์ด ๋‹ค๋ฅธ CXL ์ด๊ธฐ์ข… ๋ฉ”๋ชจ๋ฆฌ์—์„œ ํŠนํžˆ ์˜๋ฏธ๊ฐ€ ํฌ๋ฉฐ, ๋ณธ ๋ชฉ๋ก์˜ PIM/PNMยทCXLยท์–‘์žํ™” ์—ฐ๊ตฌ๋“ค์ด ๊ทธ ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

์ฃผ์˜(์ •ํ™•์„ฑยท์™„์ „์„ฑ) โ€” ํ•™ํšŒยท์—ฐ๋„๋Š” ๊ณต์‹ ํ”„๋กœ๊ทธ๋žจ(hpca-conf.org, asplos-conference.org, IEEE/ACM DL)๊ณผ venue-ํƒœ๊น… ์ž๋ฃŒ๋กœ ๊ต์ฐจํ™•์ธํ–ˆ๋‹ค. ๋‹ค๋งŒ ํ‘œ์— 'arXiv (๋ฏธํ™•์ •)'์œผ๋กœ ํ‘œ์‹œํ•œ ํ•ญ๋ชฉ(P3-LLMยทSangamยทScalable CXL-PNMยทCXL-NDPยทL3 ๋“ฑ)์€ ๋ณธ 7๊ฐœ ํ•™ํšŒ ๊ฒŒ์žฌ๊ฐ€ ํ™•์ธ๋˜์ง€ ์•Š์•˜์œผ๋ฉฐ, ํ”„๋ฆฌํ”„๋ฆฐํŠธ์ด๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ํ•™ํšŒ(OSDIยทFASTยทMLSysยทEMNLPยทSOSP ๋“ฑ) ๋ฐœํ‘œ์ผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ ์ผ๋ถ€ ์‹œ์Šคํ…œ ๋…ผ๋ฌธ(InfiniGen=OSDI'24, Mooncake=FAST'25, DistServe=OSDI'24, Jenga=SOSP'25 ๋“ฑ)์€ ๋ณธ 7๊ฐœ ์•„ํ‚คํ…์ฒ˜ ํ•™ํšŒ ๋ฐ–์ด๋ผ ์ œ์™ธํ–ˆ๋‹ค. ๋ถ„์•ผ๊ฐ€ ๋งค์šฐ ๋น ๋ฅด๊ฒŒ ๋ฐœ์ „ํ•˜๋ฏ€๋กœ ๋ณธ ๋ชฉ๋ก์€ ๋Œ€ํ‘œ ์‚ฌ๋ก€ ๋ชจ์Œ์ด๋ฉฐ ์™„์ „ํ•˜์ง€ ์•Š๋‹ค โ€” ์ธ์šฉ ์ „ ์›๋ฌธยทDOI ํ™•์ธ์„ ๊ถŒํ•œ๋‹ค.

๋…ผ๋ฌธ ํ•™ํšŒยท์—ฐ๋„ ๊ทธ๋ฃน ํ•ต์‹ฌ ๊ธฐ์—ฌ
AttAcc ASPLOS 2024 NPU+HBM-PIM ์–ดํ…์…˜(GEMV)์„ HBM-PIM์—, FC๋ฅผ NPU์— ๋ถ„๋‹ดํ•˜๋Š” ์ด์ข… ๊ฐ€์†
NeuPIMs ASPLOS 2024 NPU+HBM-PIM GEMM-NPU + GEMV-PIM ๋™์‹œ ํ™œ์šฉ, ๋ฐฐ์น˜ ์ถ”๋ก  ๊ฐ€์†(์ด์ค‘ ํ–‰๋ฒ„ํผ)
IANUS ASPLOS 2024 NPU-PIM ํ†ตํ•ฉ NPU-PIM ํ†ตํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์œผ๋กœ LLM ์ถ”๋ก  ๊ฐ€์†
SpecPIM ASPLOS 2024 PIM speculative speculative decoding์„ PIM์—์„œ ๊ฐ€์†(์•„ํ‚คํ…์ฒ˜-๋ฐ์ดํ„ฐํ”Œ๋กœ ๊ณต๋™ํƒ์ƒ‰)
PAISE HPCA 2025 PIM ์Šค์ผ€์ค„๋ง Transformer LLM์šฉ PIM-๊ฐ€์† ์ถ”๋ก  ์Šค์ผ€์ค„๋ง ์—”์ง„(Samsung SDS)
FACIL HPCA 2025 SoC-PIM ์œ ์—ฐํ•œ DRAM ์ฃผ์†Œ ๋งคํ•‘์œผ๋กœ SoC-PIM ํ˜‘๋ ฅ on-device ์ถ”๋ก (SNU)
LAD HPCA 2025 ๋””์ฝ”๋”ฉ ๊ฐ€์† locality-aware decoding ๊ฐ€์†๊ธฐ(ICT, CAS)
Cambricon-LLM MICRO 2024 ์นฉ๋ › 70B LLM on-device ์ถ”๋ก ์šฉ ์นฉ๋ › ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์•„ํ‚คํ…์ฒ˜
MCBP MICRO 2025 ๋น„ํŠธ์Šฌ๋ผ์ด์Šค bit-slice ํฌ์†Œยท๋ฐ˜๋ณต์„ฑ์œผ๋กœ GEMMยทKV ์ ‘๊ทผ ์ ˆ๊ฐ(A100 ๋Œ€๋น„ ์—๋„ˆ์ง€ํšจ์œจโ†‘)
REPA ASPLOS 2026 ์žฌ๊ตฌ์„ฑํ˜• PIM KV ์บ์‹œ ์˜คํ”„๋กœ๋”ฉ๊ณผ ์ฒ˜๋ฆฌ๋ฅผ ๊ณต๋™ ๊ฐ€์†ํ•˜๋Š” ์žฌ๊ตฌ์„ฑํ˜• PIM(SJTU)
STARC ASPLOS 2026 PIM ๋””์ฝ”๋”ฉ ์„ ํƒ์  ํ† ํฐ ์ ‘๊ทผ + ๋ฆฌ๋งคํ•‘ยทํด๋Ÿฌ์Šคํ„ฐ๋ง์œผ๋กœ PIM ๋””์ฝ”๋”ฉ ํšจ์œจํ™”
LPU ASPLOS 2026 ์ „์šฉ ๊ธฐํŒ hardwired-neuron Language Processing Unit(ICT, CAS)
Pimba ISCA 2025 ์ €์ •๋ฐ€ PIM ์ €์ •๋ฐ€ ์‚ฐ์ˆ  PCU๋กœ KV ์–‘์žํ™” PIM ๋ฉด์ ํšจ์œจ ๊ฐœ์„ 
P3-LLM arXiv (๋ฏธํ™•์ •) NPU-PIM ํ˜ผํ•ฉ์ •๋ฐ€ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ˆ˜์น˜ํ˜•์‹(W4A8KV4)๋กœ ์ €์ •๋ฐ€ edge LLM PIM ์ถ”๋ก 
Sangam arXiv (๋ฏธํ™•์ •) ์นฉ๋ ›+CXL ์นฉ๋ › DRAM-PIM + CXL ํ†ตํ•ฉ LLM ์ถ”๋ก  ๊ฐ€์†๊ธฐ
๋…ผ๋ฌธ ํ•™ํšŒยท์—ฐ๋„ ๊ทธ๋ฃน ํ•ต์‹ฌ ๊ธฐ์—ฌ
LPDDR-CXL-PNM HPCA 2024 CXL-PNM LPDDR ๊ธฐ๋ฐ˜ CXL-PNM ํ”Œ๋žซํผ์œผ๋กœ TCO ํšจ์œจ์  Transformer(GPT) ์ถ”๋ก 
SkyByte HPCA 2025 CXL-SSD ๋ฉ”๋ชจ๋ฆฌ ์‹œ๋งจํ‹ฑ CXL-SSD๋ฅผ OS-HW ๊ณต๋™์„ค๊ณ„(๊ธฐํšŒ์  ๋ฌธ๋งฅ์ „ํ™˜ยทํŽ˜์ด์ง€ ์Šน๊ฒฉ)
Cohet HPCA 2026 CXL ์ฝ”ํžˆ์–ด๋ŸฐํŠธ CXL ๊ธฐ๋ฐ˜ ์ฝ”ํžˆ์–ด๋ŸฐํŠธ ์ด๊ธฐ์ข… ์ปดํ“จํŒ… ํ”„๋ ˆ์ž„์›Œํฌ + ์ „์ฒด์‹œ์Šคํ…œ ์‹œ๋ฎฌ
Demystifying CXL Type-2 MICRO 2024 CXL Type-2 CXL Type-2 ๋””๋ฐ”์ด์Šค ํŠน์„ฑ ๋ถ„์„(์ด๊ธฐ์ข… ํ˜‘๋ ฅ ์ปดํ“จํŒ… ๊ด€์ )
Scalable CXL-PNM arXiv (๋ฏธํ™•์ •) CXL-PNM 1M ํ† ํฐยท405B์„œ ํ† ํฐํŽ˜์ด์ง€ ์„ ํƒ์„ CXL ๋‚ด๋ถ€ ๊ฐ€์†๊ธฐ์„œ ์ˆ˜ํ–‰(21.9๋ฐฐ)
CXL-NDP arXiv (๋ฏธํ™•์ •) CXL near-data ํ‘œ์ค€ CXL.mem ์œ ์ง€, ๋น„ํŠธํ”Œ๋ ˆ์ธ+๋ฌด์†์‹ค ์••์ถ•์œผ๋กœ ์œ ํšจ ๋Œ€์—ญํญ ์ฆํญ
Pond (์ฐธ๊ณ ) ASPLOS 2023 CXL ํ’€๋ง CXL ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ํ’€๋ง(ํด๋ผ์šฐ๋“œ) โ€” ๋ฒ”์œ„ ๋ฐ–์ด๋‚˜ ๊ธฐ๋ฐ˜ ์—ฐ๊ตฌ
๋…ผ๋ฌธ ํ•™ํšŒยท์—ฐ๋„ ๊ทธ๋ฃน ํ•ต์‹ฌ ๊ธฐ์—ฌ
InstAttention/InstInfer HPCA 2025 in-storage ๋””์ฝ”๋”ฉ KV ์–ดํ…์…˜์„ Computational Storage Drive(CSD)๋กœ ์˜คํ”„๋กœ๋“œ(PKU)
Lincoln HPCA 2025 LPDDR-ํ”Œ๋ž˜์‹œ LPDDR ์ธํ„ฐํŽ˜์ด์Šคยท์—ฐ์‚ฐ๊ฐ€๋Šฅ ํ”Œ๋ž˜์‹œ๋กœ 50~100B LLM ์‹ค์‹œ๊ฐ„ ์ถ”๋ก (THU)
Hermes (NDP-DIMM) HPCA 2025 NDP-DIMM NDP-DIMM์œผ๋กœ GPU ๋ฉ”๋ชจ๋ฆฌ ์ฆ๊ฐ•, ์ €๋น„์šฉ ์ถ”๋ก (ICT, CAS)
AsyncDIMM HPCA 2025 DIMM-NMP DIMM ๊ธฐ๋ฐ˜ ๊ทผ์ ‘๋ฉ”๋ชจ๋ฆฌ์˜ ๋น„๋™๊ธฐ ์‹คํ–‰ ๋‹ฌ์„ฑ(SJTU)
UniNDP HPCA 2025 near-DRAM near-DRAM ์ฒ˜๋ฆฌ ์•„ํ‚คํ…์ฒ˜์šฉ ํ†ตํ•ฉ ์ปดํŒŒ์ผยท์‹œ๋ฎฌ ๋„๊ตฌ(THU)
L3 arXiv (๋ฏธํ™•์ •) DIMM-PIM DIMM-PIM ํ†ตํ•ฉ ์•„ํ‚คํ…์ฒ˜ยท์กฐ์ •์œผ๋กœ ํ™•์žฅํ˜• long-context ์ถ”๋ก 
๋…ผ๋ฌธ ํ•™ํšŒยท์—ฐ๋„ ๊ทธ๋ฃน ํ•ต์‹ฌ ๊ธฐ์—ฌ
Oaken ISCA 2025 ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์–‘์žํ™” online-offline ํ•˜์ด๋ธŒ๋ฆฌ๋“œ KV ์–‘์žํ™”๋กœ ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ธ ์„œ๋น™(KAIST)
BitMoD HPCA 2025 ๋น„ํŠธ์‹œ๋ฆฌ์–ผ bit-serial mixture-of-datatype, ๊ทธ๋ฃน๋ณ„ ๋ฐ์ดํ„ฐํ˜•์‹ ์ ์‘(Cornell)
MANT HPCA 2025 ์ €๋น„ํŠธ ๊ทธ๋ฃน ์ˆ˜ํ•™์  ์ ์‘ ์ˆ˜์น˜ํ˜•์œผ๋กœ ๊ทธ๋ฃน๋ณ„ ์ €๋น„ํŠธ ์–‘์žํ™” + ์‹ค์‹œ๊ฐ„ ์–‘์žํ™” ์œ ๋‹›(SJTU)
Anda HPCA 2025 ํ™œ์„ฑ ๋ฐ์ดํ„ฐํ˜•์‹ ๊ฐ€๋ณ€๊ธธ์ด ๊ทธ๋ฃน๊ณต์œ  ์ง€์ˆ˜์˜ ์ ์‘ํ˜• ํ™œ์„ฑ ๋ฐ์ดํ„ฐ ํ˜•์‹(NJU/KU Leuven)
VQ-LLM HPCA 2025 ๋ฒกํ„ฐ ์–‘์žํ™” ๋ฒกํ„ฐ์–‘์žํ™” ์ฆ๊ฐ• LLM ์ถ”๋ก ์šฉ ๊ณ ์„ฑ๋Šฅ ์ฝ”๋“œ ์ƒ์„ฑ(SJTU)
ZipServ ASPLOS 2026 ๋ฌด์†์‹ค ์••์ถ• ํ•˜๋“œ์›จ์–ด ์ธ์‹ ๋ฌด์†์‹ค ์••์ถ•์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌโ†“ยท์ •ํ™•์„ฑ ๋ณด์กด(HKUST-GZ)
Cocktail DATE 2025 ํ˜ผํ•ฉ์ •๋ฐ€ chunk-adaptive ํ˜ผํ•ฉ์ •๋ฐ€ ์–‘์žํ™”๋กœ long-context ์ถ”๋ก 
eDKM HPCA 2025 ๊ฐ€์ค‘์น˜ ํด๋Ÿฌ์Šคํ„ฐ train-time ๊ฐ€์ค‘์น˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์œผ๋กœ LLaMA-7B 12.6GBโ†’2.5GB(Apple)
๋…ผ๋ฌธ ํ•™ํšŒยท์—ฐ๋„ ๊ทธ๋ฃน ํ•ต์‹ฌ ๊ธฐ์—ฌ
ALISA ISCA 2024 ํฌ์†Œ ์–ดํ…์…˜ sparsity-aware attention์œผ๋กœ salient token๋งŒ ์บ์‹œ + ๋™์  ์Šค์ผ€์ค„๋ง
SpeContext ASPLOS 2026 ๋ฌธ๋งฅ ํฌ์†Œ speculative context sparsity๋กœ long-context ์ถ”๋ก  ํšจ์œจํ™”(SJTU)
PAT ASPLOS 2026 ํ”„๋ฆฌํ”ฝ์Šค ์–ดํ…์…˜ prefix-aware attention + ๋ฉ€ํ‹ฐํƒ€์ผ ์ปค๋„๋กœ ๋””์ฝ”๋”ฉ ๊ฐ€์†
Mugi ASPLOS 2026 ๊ฐ’ ์ˆ˜์ค€ ๋ณ‘๋ ฌ value-level parallelism โ€” ํ…์„œ/์‹œํ€€์Šค๋ณด๋‹ค ์„ธ๋ฐ€ํ•œ ๋ณ‘๋ ฌ ์ฐจ์›(CMU)
I/O Analysis is All You Need ASPLOS 2026 IO ๋ถ„์„ long-sequence ์–ดํ…์…˜์˜ IO ์ค‘์‹ฌ ๋ถ„์„(FLOPs๋ณด๋‹ค ๋ฐ์ดํ„ฐ ์ด๋™์ด ์ง€๋ฐฐ)
V-Rex HPCA 2026 ๋™์  KV retrieval ์ŠคํŠธ๋ฆฌ๋ฐ ๋น„๋””์˜ค LLM ๊ฐ€์†, ๋™์  KV ์บ์‹œ retrieval
๋…ผ๋ฌธ ํ•™ํšŒยท์—ฐ๋„ ๊ทธ๋ฃน ํ•ต์‹ฌ ๊ธฐ์—ฌ
Splitwise ISCA 2024 P/D ๋ถ„๋ฆฌ prompt/token ๋‹จ๊ณ„๋ฅผ ๋‹ค๋ฅธ ๋จธ์‹ ยท์ด์ข… GPU์— ๋ถ„๋ฆฌ(2.35๋ฐฐ ์ฒ˜๋ฆฌ๋Ÿ‰)
MuxWise ASPLOS 2026 intra-GPU ๋ฉ€ํ‹ฐํ”Œ๋ ‰์Šค GPU ๋‚ด prefill-decode ๋ฉ€ํ‹ฐํ”Œ๋ ‰์‹ฑ ์„œ๋น™(SLO ์ธ์‹ ๋””์ŠคํŒจ์ฒ˜)
Bullet ASPLOS 2026 P/D ๋™์‹œ์‹คํ–‰ prefillยทdecode ๋™์‹œ ์‹คํ–‰ + ์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์ž์›๋ฐฐ๋ถ„(SYSU)
TPLA ASPLOS 2026 ๋ถ„๋ฆฌ + latent attn tensor-parallel latent attention์œผ๋กœ ๋ถ„๋ฆฌํ˜• P/D ์žฅ๋ฌธ ์ถ”๋ก (PKU)
QoServe ASPLOS 2026 QoS ์Šค์ผ€์ค„๋ง ์„ธ๋ฐ€ํ•œ QoS ๋ถ„๋ฅ˜ + ๋™์  chunking์œผ๋กœ SLO ๋ณด์žฅ(MSR India)
BlendServe ASPLOS 2026 ์˜คํ”„๋ผ์ธ ๋ฐฐ์นญ ์ž์› ์ธ์‹ ๋ฐฐ์นญ์œผ๋กœ ์˜คํ”„๋ผ์ธ ์ถ”๋ก  ์ฒ˜๋ฆฌ๋Ÿ‰โ†‘(UC Berkeley)
MoE-APEX ASPLOS 2026 ์ „๋ฌธ๊ฐ€ ์˜คํ”„๋กœ๋”ฉ adaptive-precision expert offloading์œผ๋กœ MoE ๋ฉ”๋ชจ๋ฆฌ ์••๋ฐ• ์™„ํ™”(SJTU)
DynamoLLM HPCA 2025 ์—๋„ˆ์ง€ ๋ถ€ํ•˜ยท์ž์› ๋”ฐ๋ผ ์—๋„ˆ์ง€ ์ตœ์  ๊ตฌ์„ฑ ์„ ํƒ(๋ณ‘๋ ฌ์„ฑยทGPU์ฃผํŒŒ์ˆ˜)(UIUC/MS)
throttLL'eM HPCA 2025 ์—๋„ˆ์ง€ ์˜ˆ์ธก์  GPU ์Šค๋กœํ‹€๋ง์œผ๋กœ ์—๋„ˆ์ง€ ํšจ์œจ ์ถ”๋ก  ์„œ๋น™(NTU Athens)
Kelle MICRO 2025 ์‹ ๋ขฐ์„ฑยทrefresh eDRAM+KV ๊ณต๋™์„ค๊ณ„ โ€” ์ค‘์š” ํ† ํฐ์€ ๋†’์€ refresh, ๋น„์ค‘์š”๋Š” ๋‚ฎ๊ฒŒ
RoMe HPCA 2026 ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ Row Granularity Access Memory System for LLM

10. ๋™์ž‘ ์›๋ฆฌ

์ด ๋ถ„์•ผ์˜ ๊ณตํ†ต ๋™์ž‘ ์›๋ฆฌ๋Š” ๋น„๊ต์  ๋‹จ์ˆœํ•ฉ๋‹ˆ๋‹ค. prefill์€ ํฐ ํ–‰๋ ฌ๊ณฑ(GEMM) ๋น„์ค‘์ด ๋†’์•„ GPU/NPU ๊ฐ™์€ ๊ณ„์‚ฐ ์ž์›์— ์ž˜ ๋งž๊ณ , decode๋Š” ๋งค ํ† ํฐ๋งˆ๋‹ค KV ์บ์‹œ๋ฅผ ๊ณ„์† ์ฝ๋Š” GEMV ์„ฑ๊ฒฉ์ด ๊ฐ•ํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ๊ณผ ์šฉ๋Ÿ‰์ด ๋ณ‘๋ชฉ์ด ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ฐ ๋…ผ๋ฌธ์€ prefill๊ณผ decode๋ฅผ ๊ฐ™์€ ๊ณณ์—์„œ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•˜๋ ค ํ•˜๊ธฐ๋ณด๋‹ค, ๋ณ‘๋ชฉ์ด ๋‹ค๋ฅธ ๊ตฌ๊ฐ„์„ ์„œ๋กœ ๋‹ค๋ฅธ ํ•˜๋“œ์›จ์–ด์— ๋ฐฐ์น˜ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ˆ˜๋ ดํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ๊ฐ„ ์ฃผ๋œ ๋ณ‘๋ชฉ ๋Œ€ํ‘œ ๋Œ€์‘
Prefill compute-bound GEMM GPU/NPU ์ง‘์ค‘, ๋ฐฐ์น˜ ์ตœ์ ํ™”
Decode memory-bound GEMV HBM-PIM, CXL-PNM, DIMM-NDP
Tiering ์šฉ๋Ÿ‰ยท๋Œ€์—ญํญ ๋ถ€์กฑ CXL, DRAM, SSD, ํ”Œ๋ž˜์‹œ ๋ถ„์‚ฐ ๋ฐฐ์น˜

KV๋ฅผ ์ž‘๊ฒŒ ๋งŒ๋“œ๋Š” ๊ณ„์—ด์€ ๊ฐ™์€ ํ๋ฆ„ ์•ˆ์—์„œ ๋” ๋งŽ์€ ์š”์ฒญ์„ ๋‹ด๊ฒŒ ํ•ด ์ฃผ๊ณ , P/D ๋ถ„๋ฆฌ ๊ณ„์—ด์€ ๊ณ„์‚ฐ ๊ฒฝ๋กœ๋ฅผ ๋‚˜๋ˆ  ๊ฐ„์„ญ์„ ์ค„์ž…๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด PIM/PNM ๊ณ„์—ด์€ ๋ฐ์ดํ„ฐ๋ฅผ GPU๋กœ ์˜ฎ๊ธฐ๊ธฐ ์ „์— ๋ฉ”๋ชจ๋ฆฌ ๊ณ์—์„œ ๋๋‚ด๋ ค ํ•˜๊ณ , in-storage ๊ณ„์—ด์€ ๋” ์•„๋ž˜ ํ‹ฐ์–ด์—์„œ KV๋ฅผ ๋‹ค๋ฃจ๋ฉฐ ์ „์†ก ์ž์ฒด๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.

11. ์žฅ๋‹จ์ 

๊ด€์  ์žฅ์  ํ•œ๊ณ„
PIM/PNM decode ๋ณ‘๋ชฉ์„ ์ง์ ‘ ๊ฒจ๋ƒฅํ•˜๊ณ  ๋ฐ์ดํ„ฐ ์ด๋™์„ ์ค„์ž„ ํ•˜๋“œ์›จ์–ด ์˜์กด์„ฑ์ด ํฌ๊ณ  ์„ค๊ณ„ ๋ณต์žก๋„๊ฐ€ ๋†’์Œ
CXL/ํ‹ฐ์–ด๋ง ์šฉ๋Ÿ‰ ํ™•์žฅ์ด ์‰ฝ๊ณ  pooled memory์™€ ์—ฐ๊ฒฐ ๊ฐ€๋Šฅ ๋งํฌ ์ง€์—ฐ๊ณผ ์ผ๊ด€์„ฑ ๊ด€๋ฆฌ๊ฐ€ ๋ถ€๋‹ด
in-storage/DIMM-NDP GPU ๋ฐ– ๋Œ€์šฉ๋Ÿ‰ ๊ณ„์ธต๊นŒ์ง€ ํ™œ์šฉ ๊ฐ€๋Šฅ ๋Œ€์—ญํญ์ด ๋‚ฎ๊ณ  ์ปค๋„/์†Œํ”„ํŠธ์›จ์–ด ๊ฐœ์กฐ๊ฐ€ ํ•„์š”
KV ์–‘์žํ™”/์••์ถ• ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ ํšจ๊ณผ๊ฐ€ ์ฆ‰์‹œ ํฌ๊ณ  ์ ์šฉ ๋ฒ”์œ„๊ฐ€ ๋„“์Œ ์ •ํ™•๋„ ์†์‹ค๊ณผ ๋ฐ์ดํ„ฐํ˜•์‹ ์„ค๊ณ„๊ฐ€ ์ค‘์š”
P/D ๋ถ„๋ฆฌยท์„œ๋น™ prefill๊ณผ decode๋ฅผ ๋”ฐ๋กœ ์ตœ์ ํ™”ํ•ด ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋†’์ž„ ๋ผ์šฐํŒ…ยท์Šค์ผ€์ค„๋งยท์ƒํƒœ ๊ด€๋ฆฌ๊ฐ€ ๋ณต์žกํ•ด์ง
์‹ ๋ขฐ์„ฑ ์ถ• ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ์ „๋ฐ˜์˜ ์˜ค๋ฅ˜ ๋ฏผ๊ฐ๋„๋ฅผ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์Œ ์•„์ง ์—ฐ๊ตฌ ์ˆ˜๊ฐ€ ์ ๊ณ  ํ‰๊ฐ€ ๊ธฐ์ค€์ด ๋ถ„์‚ฐ๋จ

์ „์ฒด์ ์œผ๋กœ ๋ณด๋ฉด ์ด ๋ถ„์•ผ๋Š” ์„ฑ๋Šฅ ์ด๋“์ด ๋ถ„๋ช…ํ•˜์ง€๋งŒ, ์‹œ์Šคํ…œ ํ†ตํ•ฉ๊ณผ ๊ฒ€์ฆ ๋น„์šฉ์ด ๋†’์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์‹ค์ œ ์ œํ’ˆํ™” ๊ด€์ ์—์„œ๋Š” ์ •ํ™•๋„, ์ง€์—ฐ, ๋น„์šฉ, ์‹ ๋ขฐ์„ฑ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ํ•จ๊ป˜ ๋งž์ถฐ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

12. ๊ด€๋ จ ๊ธฐ์ˆ 

์ž๋ฃŒ ์—ฐ๊ฒฐ์ 
Memory Centric LLM Serving Survey KV ๊ด€๋ฆฌ, ์˜คํ”„๋กœ๋”ฉ, P/D ๋ถ„๋ฆฌ, PIM/PNM ์ถ• ์ •๋ฆฌ
PagedAttention Analysis page/block ๊ธฐ๋ฐ˜ KV ๊ด€๋ฆฌ
KV Cache Quantization Analysis KV ๋น„ํŠธํญ ์ถ•์†Œ์™€ ์ •ํ™•๋„ ์ ˆ์ถฉ
Disaggregated LLM Serving Analysis P/D ๋ถ„๋ฆฌ์™€ ํด๋Ÿฌ์Šคํ„ฐ ๋ฐฐ์น˜
CXL Version Comparison CXL Type-1/2/3์˜ ๋ฉ”๋ชจ๋ฆฌ ํ™•์žฅยท์ฝ”ํžˆ์–ด๋Ÿฐ์‹œ ๊ธฐ๋ฐ˜
NAND NVMe SSD Analysis in-storage ๊ณ„์—ด์˜ ํ•˜๋ถ€ ์ €์žฅ ๊ณ„์ธต ๋ฐฐ๊ฒฝ
Memory Controller ๋Œ€์—ญํญยท์ฑ„๋„ยท๋ฑ…ํฌ ๋ฐฐ์น˜ ๊ด€์ 

์ด ๋ฌธ์„œ์˜ ๋ฒ”์œ„ ๋ฐ–์œผ๋กœ๋Š” KV cache management ์„œ๋ฒ ์ด, ํšจ์œจ์  LLM ์ถ”๋ก  ์„œ๋ฒ ์ด, ํ•˜๋“œ์›จ์–ด ๊ด€์  ์„œ๋ฒ ์ด๊ฐ€ ์ง์ ‘์ ์œผ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต๊ณผ ์Šค์ผ€์ค„๋ง์ด ํ•จ๊ป˜ ์–ฝํžˆ๋ฏ€๋กœ, ๊ฐœ๋ณ„ ๋…ผ๋ฌธ๋ณด๋‹ค ์ด๋“ค ์ •๋ฆฌ ๋ฌธ์„œ์™€ ํ•จ๊ป˜ ๋ณด๋Š” ํŽธ์ด ์ดํ•ด๊ฐ€ ๋น ๋ฆ…๋‹ˆ๋‹ค.

13. ํ•ต์‹ฌ ์ •๋ฆฌ

2024~2026๋…„ ์•„ํ‚คํ…์ฒ˜ ํ•™ํšŒ์˜ LLM ๋ฉ”๋ชจ๋ฆฌ ์—ฐ๊ตฌ๋Š” PIM/PNM, CXL ํ‹ฐ์–ด๋ง, in-storage, KV ์–‘์žํ™”, ํฌ์†Œ/์ถ•์ถœ, P/D ๋ถ„๋ฆฌ๋กœ ์ˆ˜๋ ดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ณตํ†ต ๋ฐฉํ–ฅ์€ decode์˜ memory-bound ๋ณ‘๋ชฉ์„ ์ค„์ด๊ณ , HBMยทCXLยทDRAMยทSSD๋ฅผ ์—ญํ• ๋ณ„๋กœ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‹ค๋งŒ ์‹ ๋ขฐ์„ฑ ์ถ•์€ ์•„์ง ๋น„์–ด ์žˆ๋Š” ํŽธ์ด๋ผ, reliability-aware KV ํ‹ฐ์–ด๋ง๊ณผ ์˜คํ”„๋กœ๋”ฉ์ด ๊ฐ€์žฅ ์ž์—ฐ์Šค๋Ÿฌ์šด ์—ฐ๊ตฌ ๊ณต๋ฐฑ์œผ๋กœ ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ถ„์•ผ๋Š” ์„ฑ๋Šฅ๊ณผ ๋น„์šฉ์˜ ๊ฐœ์„ ํญ์ด ํฐ ๋Œ€์‹ , ํ•˜๋“œ์›จ์–ดยท์†Œํ”„ํŠธ์›จ์–ด ํ†ตํ•ฉ ๋‚œ๋„๊ฐ€ ๋†’๋‹ค๋Š” ์ ์ด ํ•จ๊ป˜ ๋”ฐ๋ผ์˜ต๋‹ˆ๋‹ค.