Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

MoE Serving Routing LoadBalance

MoE Serving - Routing๊ณผ Load Balancing ์‹ฌ์ธต ๋ถ„์„

Expert Routing ยท All-to-All Dispatch ยท Capacity Factor ยท Expert Parallelism ยท Hot Expert Replication

๊ฐœ์š”

MoE ์„œ๋น™์˜ ๋ณ‘๋ชฉ์€ ์ „๋ฌธ๊ฐ€ ์ˆ˜๋ณด๋‹ค ํ† ํฐ์ด ์„ ํƒ๋œ ์ „๋ฌธ๊ฐ€๋กœ ์–ผ๋งˆ๋‚˜ ๊ณ ๋ฅด๊ฒŒ, ๋น ๋ฅด๊ฒŒ ์ด๋™ํ•˜๋А๋ƒ์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค. sparse activation์€ ๊ณ„์‚ฐ๋Ÿ‰์„ ์ค„์ด์ง€๋งŒ, expert parallelism์„ ์“ฐ๋Š” ์ˆœ๊ฐ„ all-to-all dispatch์™€ gather๊ฐ€ ์ถ”๊ฐ€๋˜์–ด ํ†ต์‹ ์ด ์ง€์—ฐ์˜ ์ค‘์‹ฌ์ด ๋ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋‹จ๊ณ„์˜ ๋ผ์šฐํŒ… ํ’ˆ์งˆ๊ณผ ์„œ๋น™ ๋‹จ๊ณ„์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์€ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ๋ผ์šฐํŒ…์ด ์กฐ๊ธˆ๋งŒ ์น˜์šฐ์ณ๋„ ํŠน์ • expert shard๊ฐ€ tail latency๋ฅผ ๊ฒฐ์ •ํ•˜๊ณ , capacity factor์™€ ๋ฐฐ์น˜ ์ •์ฑ…์ด ์ „์ฒด ์‹œ์Šคํ…œ์˜ ์•ˆ์ •์„ฑ์„ ์ขŒ์šฐํ•ฉ๋‹ˆ๋‹ค.

MoE Serving Routing LoadBalance

๊ทธ๋ฆผ 1. MoE serving pipeline - router, dispatch, expert compute, gather๊ฐ€ ํ•œ ํ๋ฆ„์œผ๋กœ ์ด์–ด์ง„๋‹ค.

1. ํ•ต์‹ฌ ๊ฐœ๋…

๊ฐœ๋… ์˜๋ฏธ ์„œ๋น™ ์˜ํ–ฅ
Routing / Gating ํ† ํฐ๋งˆ๋‹ค top-k expert๋ฅผ ๊ณ ๋ฅด๋Š” ๋‹จ๊ณ„ ๋ผ์šฐํŒ… ๋ถ„ํฌ๊ฐ€ ๊ณง ๋ถ€ํ•˜ ๋ถ„ํฌ๊ฐ€ ๋œ๋‹ค
Expert Parallelism expert๋ฅผ ์—ฌ๋Ÿฌ GPU์— shardํ•˜๋Š” ๋ฐฐ์น˜ ๋ฐฉ์‹ GPU ๊ฐ„ all-to-all ํ†ต์‹ ์ด ํ•„์š”ํ•˜๋‹ค
Capacity Factor expert๊ฐ€ ํ•œ ๋ฒˆ์— ๋ฐ›์„ ํ† ํฐ ์ƒํ•œ overflow๋ฅผ dropํ•˜๊ฑฐ๋‚˜ ๋Œ€๊ธฐ์‹œํ‚จ๋‹ค
Hot Expert ์„ ํƒ ๋นˆ๋„๊ฐ€ ๋†’์€ expert hotspot, queueing, tail latency๋ฅผ ๋งŒ๋“ ๋‹ค
Hot Expert Replication ์ž์ฃผ ์“ฐ๋Š” expert๋ฅผ ๋ณต์ œ ๋ฐฐ์น˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์“ฐ๋Š” ๋Œ€์‹  ๋ณ‘๋ชฉ์„ ์ค„์ธ๋‹ค
Dropless Buffering overflow๋ฅผ ๋ฐ”๋กœ ๋ฒ„๋ฆฌ์ง€ ์•Š๊ณ  ํก์ˆ˜ ํ’ˆ์งˆ์„ ์ง€ํ‚ค์ง€๋งŒ ์ง€์—ฐ์ด ๋Š˜ ์ˆ˜ ์žˆ๋‹ค
MoE Serving Routing LoadBalance

๊ทธ๋ฆผ 2. expert parallelism์—์„œ๋Š” ํ† ํฐ์ด GPU ์‚ฌ์ด๋ฅผ ์˜ค๊ฐ€๋ฉฐ, ํ•œ์ชฝ์œผ๋กœ ์ ๋ฆฌ๋ฉด ์ „์ฒด step์ด ๋А๋ ค์ง„๋‹ค.

2. ๋น„๊ต/๋ถ„์„

์ ‘๊ทผ ํ•ต์‹ฌ ์•„์ด๋””์–ด ์žฅ์  ๋‹จ์ 
Switch Transformer top-1 routing๊ณผ ๋‹จ์ˆœํ•œ ๊ท ํ˜• ์ œ์–ด ํ†ต์‹  ๋น„์šฉ์ด ๋‚ฎ๋‹ค ๋ผ์šฐํŒ… ์„ ํƒ ํญ์ด ์ข๋‹ค
GShard ๊ณ„์—ด top-k routing๊ณผ capacity ๊ด€๋ฆฌ ํ’ˆ์งˆ๊ณผ ํ™•์žฅ์„ฑ์ด ์ข‹๋‹ค ํ†ต์‹ ๊ณผ ๊ตฌํ˜„์ด ๋ณต์žกํ•˜๋‹ค
DeepSeekMoE fine-grained expert์™€ shared expert ์ „๋ฌธํ™”๊ฐ€ ์ž˜ ๋œ๋‹ค expert ๊ด€๋ฆฌ ๋ถ€๋‹ด์ด ์ปค์ง„๋‹ค
DeepSeek-V3 auxiliary-loss-free load balancing ํ•™์Šต ์•ˆ์ •์„ฑ์ด ์ข‹๋‹ค ์„œ๋น™ ๋ฐฐ์น˜ ๋ฌธ์ œ๋Š” ๋ณ„๋„๋‹ค
Hot expert replication ์ธ๊ธฐ expert ๋ณต์ œ tail latency๋ฅผ ์ค„์ธ๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•œ๋‹ค
MoE Serving Routing LoadBalance

๊ทธ๋ฆผ 3. ํ•™์Šต, ๋ผ์šฐํŒ…, ๋ฐฐ์น˜, ์ปค๋„ ์ตœ์ ํ™”๊ฐ€ ๊ฐ๊ฐ ๋‹ค๋ฅธ ๋ ˆ์ด์–ด์—์„œ ๋ถ€ํ•˜๋ฅผ ์ค„์ธ๋‹ค.

3. ๋™์ž‘ ์›๋ฆฌ

ํ† ํฐ์€ router์—์„œ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•œ ๋’ค top-k expert๋กœ ์„ ํƒ๋˜๊ณ , dispatcher๊ฐ€ ๊ฐ™์€ expert๋กœ ๊ฐˆ ํ† ํฐ์„ ๋ฌถ์–ด GPU๋ณ„๋กœ ๋ณด๋ƒ…๋‹ˆ๋‹ค. expert ๊ณ„์‚ฐ์ด ๋๋‚˜๋ฉด gather๊ฐ€ ์›๋ž˜ ์ˆœ์„œ๋กœ ์ถœ๋ ฅ์„ ํ•ฉ์นฉ๋‹ˆ๋‹ค.

์ด ํ๋ฆ„์—์„œ ๊ฐ€์žฅ ๋ฏผ๊ฐํ•œ ์ง€์ ์€ routing skew์ž…๋‹ˆ๋‹ค. ํŠน์ • expert๋กœ ํ† ํฐ์ด ๋ชฐ๋ฆฌ๋ฉด ๊ทธ expert๊ฐ€ ์†ํ•œ GPU๋Š” ๊ณ„์‚ฐ๊ณผ ํ†ต์‹ ์„ ๋™์‹œ์— ๋– ์•ˆ๊ณ , ๋‚˜๋จธ์ง€ GPU๋Š” ์œ ํœด ์ƒํƒœ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. capacity factor๊ฐ€ ์ž‘์œผ๋ฉด overflow๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ , ํฌ๋ฉด queueing์ด ๊ธธ์–ด์ง‘๋‹ˆ๋‹ค.

MoE Serving Routing LoadBalance

๊ทธ๋ฆผ 4. expert๋ณ„ ํ† ํฐ ์ˆ˜๊ฐ€ ๊ท ์ผํ•˜์ง€ ์•Š์œผ๋ฉด capacity ์ดˆ๊ณผ์™€ ๋Œ€๊ธฐ์—ด์ด ๋™์‹œ์— ์ƒ๊ธด๋‹ค.

4. ์žฅ๋‹จ์ 

ํ•ญ๋ชฉ ์žฅ์  ๋‹จ์ 
Sparse activation ํ™œ์„ฑ ๊ณ„์‚ฐ๋Ÿ‰์„ ์ค„์ธ๋‹ค ํ†ต์‹  ๋ณ‘๋ชฉ์ด ์ƒˆ๋กœ ์ƒ๊ธด๋‹ค
Expert specialization expert๊ฐ€ ์—ญํ• ์„ ๋‚˜๋ˆ  ํ•™์Šตํ•œ๋‹ค ํŠน์ • expert ํŽธ์ค‘์ด ์ƒ๊ธธ ์ˆ˜ ์žˆ๋‹ค
Topology-aware placement ๋„คํŠธ์›Œํฌ ๋ณ‘๋ชฉ์„ ์ค„์ธ๋‹ค ๋ฐฐ์น˜์™€ ์šด์˜์ด ๋ณต์žกํ•˜๋‹ค
Fused all-to-all ๊ณ ์ •๋น„๋ฅผ ๋‚ฎ์ถ˜๋‹ค ์ปค๋„ ๊ตฌํ˜„ ๋‚œ๋„๊ฐ€ ๋†’๋‹ค

MoE ์„œ๋น™์€ ๋ชจ๋ธ๋งŒ ๋ณด๋ฉด ์ข‹์•„ ๋ณด์ด์ง€๋งŒ, ์‹ค์ œ ์šด์˜์—์„œ๋Š” ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€์™€ expert ๋ฐฐ์น˜๊ฐ€ ์„ฑ๋Šฅ์„ ์ขŒ์šฐํ•ฉ๋‹ˆ๋‹ค. ์ข‹์€ ๋ผ์šฐํ„ฐ๋ณด๋‹ค ๋” ์ค‘์š”ํ•œ ๊ฒƒ์€ ๋А๋ฆฐ shard๋ฅผ ๋งŒ๋“ค์ง€ ์•Š๋Š” ์šด์˜ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

MoE Serving Routing LoadBalance

๊ทธ๋ฆผ 5. scheduler, router, dispatcher, expert compute, gather๋ฅผ ํ•˜๋‚˜์˜ ๋ฃจํ”„๋กœ ๋ฌถ์–ด์•ผ ์ง€์—ฐ์ด ์•ˆ์ •๋œ๋‹ค.

5. ๊ด€๋ จ ๊ธฐ์ˆ 

Switch Transformer๋Š” routing ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ˆœํ™”ํ•˜๊ณ  ํ†ต์‹  ๋น„์šฉ๊ณผ ํ•™์Šต ๋ถˆ์•ˆ์ •์„ ์ค„์˜€๊ณ , DeepSeekMoE๋Š” fine-grained expert์™€ shared expert๋กœ ์ „๋ฌธํ™”๋ฅผ ๊ฐ•ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. DeepSeek-V3๋Š” 671B total / 37B activated ๊ตฌ์กฐ์™€ auxiliary-loss-free load balancing์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด ์•ˆ์ •์„ฑ๊ณผ ํšจ์œจ์„ ๋งž์ท„์Šต๋‹ˆ๋‹ค.

6. ํ•ต์‹ฌ ์ •๋ฆฌ

MoE ์„œ๋น™์˜ ํ•ต์‹ฌ์€ sparse activation ์ž์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ, ํ† ํฐ์ด ์„ ํƒ๋œ expert๋กœ ์–ผ๋งˆ๋‚˜ ๊ท ๋“ฑํ•˜๊ฒŒ ํ˜๋Ÿฌ๊ฐ€๋А๋ƒ์ž…๋‹ˆ๋‹ค. routing skew๋Š” ๊ณ„์‚ฐ ๋ฌธ์ œ์ด๋ฉด์„œ ๋™์‹œ์— ํ†ต์‹  ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ MoE ์šด์˜์€ ๋ชจ๋ธ ์„ค๊ณ„๋งŒ์œผ๋กœ ๋๋‚˜์ง€ ์•Š๊ณ , capacity factor, batch policy, expert placement, hot expert replication, fused all-to-all๊นŒ์ง€ ํ•จ๊ป˜ ๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์„œ๋น™ ํ’ˆ์งˆ์€ ๊ฐ€์žฅ ๋น ๋ฅธ expert๊ฐ€ ์•„๋‹ˆ๋ผ ๊ฐ€์žฅ ๋А๋ฆฐ shard๊ฐ€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ๋งฅ์—์„œ MoE๋Š” ๋ถ„์‚ฐ ์‹œ์Šคํ…œ ๋ฌธ์ œ๋กœ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์ด ๋งž์Šต๋‹ˆ๋‹ค.