Ryotta's Basic

LLM
๐Ÿค– LLM ๊ฒ€์ฆ์™„๋ฃŒ

MoE Analysis

MoE (Mixture of Experts) ์‹ฌ์ธต ๋ถ„์„

Sparse Activation ยท Routing/Gating ยท Load Balancing ยท DeepSeekMoE ยท Expert Parallelism/Offloading

MoE(Mixture of Experts)๋Š” ํ† ํฐ๋งˆ๋‹ค ๋ชจ๋ธ ์ „์ฒด๋ฅผ ๊ณ„์‚ฐํ•˜์ง€ ์•Š๊ณ , ๋ผ์šฐํ„ฐ๊ฐ€ ์„ ํƒํ•œ ์ผ๋ถ€ ์ „๋ฌธ๊ฐ€๋งŒ ํ™œ์„ฑํ™”ํ•˜๋Š” ํฌ์†Œ(sparse) ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ์ด ํŒŒ๋ผ๋ฏธํ„ฐ(๋ชจ๋ธ ์šฉ๋Ÿ‰)์™€ ํ™œ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ(ํ† ํฐ๋‹น ๊ณ„์‚ฐ)๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค. Shazeer et al.(2017)์ด conditional computation์„ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์— ๋‹ค์‹œ ์‹ค์šฉํ™”ํ•œ ๋’ค, Switch Transformer, GShard, DeepSeekMoE, DeepSeek-V3 ๊ฐ™์€ ๊ณ„์—ด์ด ์ด ํ๋ฆ„์„ ์ด์–ด ์™”์Šต๋‹ˆ๋‹ค.

MoE๋Š” ๊ณ„์‚ฐ ํšจ์œจ๋งŒ์˜ ์ด์•ผ๊ธฐ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ๋ผ์šฐํŒ… ํ’ˆ์งˆ, ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ, all-to-all ํ†ต์‹ , ์ „๋ฌธ๊ฐ€ ๋ณ‘๋ ฌํ™”, ์˜คํ”„๋กœ๋”ฉ๊นŒ์ง€ ํ•จ๊ป˜ ๋ด์•ผ ์‹œ์Šคํ…œ์œผ๋กœ์„œ ์„ฑ๋ฆฝํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์„œ๋Š” ํฌ์†Œ ํ™œ์„ฑํ™” -> ๋ผ์šฐํŒ… -> ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ๊ณผ DeepSeekMoE -> ์‹œ์Šคํ…œ ๊ด€์  -> ์žฅ๋‹จ์ ๊ณผ ์ฑ„ํƒ ํ˜„ํ™ฉ -> ๊ด€๋ จ ๊ธฐ์ˆ ์˜ ์ˆœ์„œ๋กœ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

1. ํ•ต์‹ฌ ์•„์ด๋””์–ด - ํฌ์†Œ ํ™œ์„ฑํ™”

Dense ๋ชจ๋ธ์€ ํ† ํฐ๋งˆ๋‹ค FFN์˜ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•˜๋ฏ€๋กœ, ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋Š˜๋ฆฌ๋ฉด ๊ณ„์‚ฐ๋Ÿ‰๋„ ํ•จ๊ป˜ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. MoE๋Š” FFN์„ ์—ฌ๋Ÿฌ ์ „๋ฌธ๊ฐ€๋กœ ๋‚˜๋ˆ„๊ณ  ๋ผ์šฐํ„ฐ๊ฐ€ ํ† ํฐ๋งˆ๋‹ค ์ผ๋ถ€(top-k)๋งŒ ๊ณ ๋ฅด๊ฒŒ ํ•˜์—ฌ, ์šฉ๋Ÿ‰๊ณผ ๊ณ„์‚ฐ์„ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

llm_0005_moe_analysis

๊ทธ๋ฆผ 1. Dense FFN์€ ํ•ญ์ƒ ์ „์ฒด๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , MoE๋Š” ์„ ํƒ๋œ ์ „๋ฌธ๊ฐ€๋งŒ ํ™œ์„ฑํ™”ํ•œ๋‹ค.

์กฐ๊ฑด๋ถ€ ๊ณ„์‚ฐ(Conditional Computation)

  • ์ „๋ฌธ๊ฐ€(expert) - ๋ณดํ†ต Transformer์˜ FFN(ํ”ผ๋“œํฌ์›Œ๋“œ) ์ธต์„ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ์–ดํ…์…˜์€ dense๋กœ ๊ทธ๋Œ€๋กœ ๊ณต์œ ๋ฉ๋‹ˆ๋‹ค.
  • ๋ผ์šฐํ„ฐ(๊ฒŒ์ดํŒ…) - ํ† ํฐ ํ‘œํ˜„์„ ๋ณด๊ณ  ๊ฐ ์ „๋ฌธ๊ฐ€์˜ ์ ํ•ฉ๋„ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•œ ๋’ค top-k๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  • ํ™œ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ - ์ด ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ปค์ ธ๋„ ํ† ํฐ๋‹น ๊ณ„์‚ฐ์€ ์ž‘๊ฒŒ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด DeepSeek-V3๋Š” 671B total / 37B active, Mixtral 8x7B๋Š” 47B total / 13B active๋กœ ์†Œ๊ฐœ๋ฉ๋‹ˆ๋‹ค.
  • ๊ธฐ์› - Jacobs et al.(1991)์˜ MoE ๊ฐœ๋…์„ Shazeer et al.(2017)์˜ sparsely-gated MoE๊ฐ€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์— ๋‹ค์‹œ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

Dense vs MoE

ํ•ญ๋ชฉ Dense FFN MoE
ํ† ํฐ๋‹น ํ™œ์„ฑํ™” ์ „์ฒด FFN top-k ์ „๋ฌธ๊ฐ€๋งŒ
๊ณ„์‚ฐ๋Ÿ‰ ํŒŒ๋ผ๋ฏธํ„ฐ ์ฆ๊ฐ€์™€ ํ•จ๊ป˜ ์ฆ๊ฐ€ ํ™œ์„ฑ ์ „๋ฌธ๊ฐ€ ์ˆ˜์— ์ฃผ๋กœ ๋น„๋ก€
์šฉ๋Ÿ‰ ํ™•์žฅ ๋น„์šฉ์ด ํผ ์šฉ๋Ÿ‰๋งŒ ํฌ๊ฒŒ ํ‚ค์šฐ๊ธฐ ์‰ฌ์›€
์‹œ์Šคํ…œ ๋ถ€๋‹ด ๊ณ„์‚ฐ ์ค‘์‹ฌ ํ†ต์‹ , ๋ฉ”๋ชจ๋ฆฌ, ๋ผ์šฐํŒ… ์ถ”๊ฐ€

2. ๋ผ์šฐํŒ… - ๊ฒŒ์ดํŒ…๊ณผ top-k ์„ ํƒ

MoE์˜ ๋™์ž‘์„ ์ขŒ์šฐํ•˜๋Š” ๊ฒƒ์€ ๋ผ์šฐํ„ฐ์ž…๋‹ˆ๋‹ค. ํ† ํฐ์„ ์–ด๋А ์ „๋ฌธ๊ฐ€๋กœ ๋ณด๋‚ผ์ง€ ๊ฒฐ์ •ํ•˜๊ณ , ์„ ํƒ๋œ ์ „๋ฌธ๊ฐ€๋“ค์˜ ์ถœ๋ ฅ์„ ๊ฐ€์ค‘ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

MoE Analysis

๊ทธ๋ฆผ 2. ํ† ํฐ -> ๋ผ์šฐํ„ฐ ์ ์ˆ˜ -> top-k ์ „๋ฌธ๊ฐ€ -> ๊ฐ€์ค‘ํ•ฉ.

scores = softmax(W_g ยท x)
T = top_k(scores)
y = sum(g_i ยท E_i(x) for i in T)

๋ผ์šฐํŒ…์˜ ํ•ต์‹ฌ

  • noisy top-k(Shazeer) - ์ ์ˆ˜์— ์žก์Œ์„ ๋”ํ•ด ํŠน์ • ์ „๋ฌธ๊ฐ€๋กœ์˜ ์ ๋ฆผ(collapse)์„ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • k์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ - k๋ฅผ ํ‚ค์šฐ๋ฉด ๊ณ„์‚ฐ๊ณผ ํ’ˆ์งˆ์ด ์˜ค๋ฅด๊ณ , ์ค„์ด๋ฉด ํšจ์œจ์ด ์˜ค๋ฅด์ง€๋งŒ ๋ˆ„๋ฝ ์œ„ํ—˜์ด ์ปค์ง‘๋‹ˆ๋‹ค. Switch=1, Mixtral=2, DeepSeek-V3=8๋กœ ์ž์ฃผ ์„ค๋ช…๋ฉ๋‹ˆ๋‹ค.
  • ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ์„ฑ - ๋ผ์šฐํŒ…์€ ์ด์‚ฐ ์„ ํƒ์ด๋ฏ€๋กœ, ์‹ค์ œ ๊ตฌํ˜„์€ ๊ฒŒ์ดํŠธ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณฑํ•ด ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • capacity factor - ์ „๋ฌธ๊ฐ€๊ฐ€ ์ˆ˜์šฉํ•  ํ† ํฐ ์ˆ˜๋ฅผ ์ œํ•œํ•ด overflow๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ดˆ๊ณผ ํ† ํฐ์€ drop๋˜๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ๊ฒฝ๋กœ๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.

๋ผ์šฐํŒ… ์ „๋žต์˜ ๋ณ€์ฒœ

์—ฐ๊ตฌ ์„ ํƒ ๋ฐฉ์‹ ํŠน์ง•
Shazeer et al. 2017 noisy top-k ํฌ์†Œ ๊ฒŒ์ดํŒ…๊ณผ ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ ์†์‹ค๋กœ MoE๋ฅผ ์‹ค์šฉํ™”
GShard (2020) top-2 ์ž๋™ ์ƒค๋”ฉ๊ณผ ํ•จ๊ป˜ 600B+ ๊ทœ๋ชจ๊นŒ์ง€ ํ™•์žฅ
Switch Transformer (2021) top-1 ๋‹จ์ˆœํ™”๋กœ ํ†ต์‹ ๊ณผ ๊ณ„์‚ฐ์„ ์ค„์ด๊ณ  ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๊ฐœ์„ 
DeepSeekMoE (2024) fine-grained top-k ์ „๋ฌธ๊ฐ€๋ฅผ ์ž˜๊ฒŒ ๋‚˜๋ˆ„๊ณ  ๊ณต์œ  ์ „๋ฌธ๊ฐ€๋ฅผ ๋‘ 
DeepSeek-V3 (2024/2025) top-8 of 256 auxiliary-loss-free ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ๊ณผ ๊ฒฐํ•ฉ

3. ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ๊ณผ DeepSeekMoE

MoE์˜ ๊ณ ์งˆ์  ๋ฌธ์ œ๋Š” routing collapse์ž…๋‹ˆ๋‹ค. ์ผ๋ถ€ ์ „๋ฌธ๊ฐ€๋งŒ ๊ณผ๋ถ€ํ•˜๋˜๊ณ  ๋‚˜๋จธ์ง€๊ฐ€ ๊ฑฐ์˜ ํ•™์Šต๋˜์ง€ ์•Š์œผ๋ฉด, ๋ชจ๋ธ ์šฉ๋Ÿ‰์ด ๊ทธ๋Œ€๋กœ ๋‚ญ๋น„๋ฉ๋‹ˆ๋‹ค. DeepSeekMoE์™€ DeepSeek-V3๋Š” ์ด ๋ฌธ์ œ๋ฅผ ๊ตฌ์กฐ์™€ ๋ผ์šฐํŒ… ์–‘์ชฝ์—์„œ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

MoE Analysis

๊ทธ๋ฆผ 3. ๋ผ์šฐํŒ… ์ ๋ฆผ, ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ, ์ „๋ฌธ๊ฐ€ ๋ถ„ํ• ๊ณผ ๊ณต์œ  ์ „๋ฌธ๊ฐ€.

๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ

  • ๋ณด์กฐ ์†์‹ค(auxiliary loss) - ์ ๋ฆผ์— ํŽ˜๋„ํ‹ฐ๋ฅผ ์ฃผ๋Š” ํ•ญ์„ ์†์‹ค์— ๋”ํ•ฉ๋‹ˆ๋‹ค. Switch Transformer์™€ GShard ๊ณ„์—ด์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • auxiliary-loss-free(DeepSeek-V3) - ์†์‹ค ํ•จ์ˆ˜๋ฅผ ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š๊ณ  ๋ผ์šฐํŒ… ์ ์ˆ˜์— ๋™์  bias๋ฅผ ๋”ํ•ด ์ด์šฉ๋ฅ  ๊ท ํ˜•์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

DeepSeekMoE์˜ ๋‘ ๊ฐ€์ง€ ์ „๋žต

  • fine-grained ์ „๋ฌธ๊ฐ€ ๋ถ„ํ•  - ์ „๋ฌธ๊ฐ€ N๊ฐœ๋ฅผ mN๊ฐœ๋กœ ์ž˜๊ฒŒ ์ชผ๊ฐœ๊ณ , ๋” ๋งŽ์€ ์ „๋ฌธ๊ฐ€๋ฅผ ํ™œ์„ฑํ™”ํ•˜๋˜ ๊ณ„์‚ฐ๋Ÿ‰์€ ๋น„์Šทํ•˜๊ฒŒ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ „๋ฌธ๊ฐ€ ์กฐํ•ฉ์ด ์œ ์—ฐํ•ด์ง€๊ณ  ์ง€์‹ ํ˜ผ์žฌ๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.
  • shared expert - ํ•ญ์ƒ ํ™œ์„ฑํ™”๋˜๋Š” ๊ณต์œ  ์ „๋ฌธ๊ฐ€๊ฐ€ ๊ณตํ†ต ์ง€์‹์„ ๋‹ด๋‹นํ•ด, ๋‚˜๋จธ์ง€ ์ „๋ฌธ๊ฐ€๊ฐ€ ํŠนํ™”์— ์ง‘์ค‘ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋ณด๊ณ ๋œ ํšจ์œจ

DeepSeekMoE ๋…ผ๋ฌธ์€ 2B ๊ทœ๋ชจ์—์„œ GShard 2.9B์™€ ๋น„๊ต ๊ฐ€๋Šฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , 16B ๊ทœ๋ชจ์—์„œ๋Š” LLaMA2-7B์™€ ๋น„๊ต ๊ฐ€๋Šฅํ•œ ์„ฑ๋Šฅ์„ ์•ฝ 40% ๊ณ„์‚ฐ๋Ÿ‰์œผ๋กœ ๋‹ฌ์„ฑํ–ˆ๋‹ค๊ณ  ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค. DeepSeek-V3๋Š” 671B total / 37B active์™€ ํ•จ๊ป˜ auxiliary-loss-free ์ „๋žต์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

4. ์‹œ์Šคํ…œ - ์ „๋ฌธ๊ฐ€ ๋ณ‘๋ ฌํ™”์™€ ๋ฉ”๋ชจ๋ฆฌยท์˜คํ”„๋กœ๋”ฉ

MoE๋Š” ๊ณ„์‚ฐ์€ ์ค„์ด์ง€๋งŒ ์‹œ์Šคํ…œ ๋ถ€๋‹ด์„ ์ƒˆ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ํŠนํžˆ ์ „๋ฌธ๊ฐ€๋“ค์ด ์–ด๋””์— ์žˆ๋А๋ƒ๊ฐ€ ํ†ต์‹ ๊ณผ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

MoE Analysis

๊ทธ๋ฆผ 4. ์ „๋ฌธ๊ฐ€ ๋ณ‘๋ ฌํ™”(all-to-all), ๋ฉ”๋ชจ๋ฆฌ ์—ญ์„ค, ์ „๋ฌธ๊ฐ€ ์˜คํ”„๋กœ๋”ฉ ๊ณ„์ธต.

์‹œ์Šคํ…œ ๊ด€์ ์˜ ํ•จ์˜

  • ์ „๋ฌธ๊ฐ€ ๋ณ‘๋ ฌํ™”(EP) - ์ „๋ฌธ๊ฐ€๋“ค์„ ์—ฌ๋Ÿฌ GPU์— ๋‚˜๋ˆ  ๋ฐฐ์น˜ํ•˜๊ณ , ํ† ํฐ์„ ํ•ด๋‹น GPU๋กœ ๋ณด๋ƒ…๋‹ˆ๋‹ค. MoE ๋ธ”๋ก๋งˆ๋‹ค scatter/gather ํ˜•ํƒœ์˜ all-to-all ํ†ต์‹ ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฉ”๋ชจ๋ฆฌ ์—ญ์„ค - ํ† ํฐ๋‹น ํ™œ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์ž‘์ง€๋งŒ, ์–ด๋А ์ „๋ฌธ๊ฐ€๊ฐ€ ์“ฐ์ผ์ง€ ๋ชจ๋ฅด๋ฏ€๋กœ ๋ชจ๋“  ์ „๋ฌธ๊ฐ€๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์ ์žฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ „๋ฌธ๊ฐ€ ์˜คํ”„๋กœ๋”ฉ - VRAM์„ hot expert ์บ์‹œ๋กœ ๋ณด๊ณ , ๋น„ํ™œ์„ฑ ์ „๋ฌธ๊ฐ€๋ฅผ CPU DRAM, CXL, NVMe๋กœ ๋‚ด๋ฆฝ๋‹ˆ๋‹ค. ํ™œ์„ฑ ์ „๋ฌธ๊ฐ€๋งŒ on-demand๋กœ ์˜ฌ๋ฆฌ์ง€๋งŒ, PCIe/CXL ๋Œ€์—ญํญ์ด ๋ณ‘๋ชฉ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • CXL์˜ ์˜๋ฏธ - ๋” ํฐ ๋ฉ”๋ชจ๋ฆฌ ํ’€์„ ๋‘๊ณ  ์ „๋ฌธ๊ฐ€๋ฅผ ๊ณ„์ธตํ™”ํ•  ์ˆ˜ ์žˆ์–ด, KV ์บ์‹œ ์˜คํ”„๋กœ๋”ฉ๊ณผ ๊ฐ™์€ ์„ค๊ณ„ ๋ฌธ์ œ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

5. ์žฅ๋‹จ์ ๊ณผ ์ฑ„ํƒ ํ˜„ํ™ฉ

MoE Analysis

๊ทธ๋ฆผ 5. MoE์˜ ์žฅ๋‹จ์ ๊ณผ ๋Œ€ํ‘œ ๋ชจ๋ธ.

์žฅ์ ๊ณผ ๋‹จ์ 

์žฅ์  ๋‹จ์ 
๊ฐ™์€ ๊ณ„์‚ฐ์œผ๋กœ ๋” ํฐ ์šฉ๋Ÿ‰ ํ™•๋ณด ์ „๋ฌธ๊ฐ€ ์ „๋ถ€๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ๋‘ฌ์•ผ ํ•จ
์ „๋ฌธ๊ฐ€ ํŠนํ™”๋กœ ํ’ˆ์งˆ ํ–ฅ์ƒ all-to-all ํ†ต์‹  ์˜ค๋ฒ„ํ—ค๋“œ
ํ•™์Šตยท์ถ”๋ก  FLOPs ์ ˆ๊ฐ ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ๊ณผ ํ•™์Šต ์•ˆ์ •์„ฑ ๋ฌธ์ œ
์‚ฌ์ „ํ•™์Šต ์†๋„ ๊ฐœ์„  ๊ฐ€๋Šฅ ๋ฐฐํฌ์™€ ์šด์˜ ๋ณต์žก๋„ ์ƒ์Šน

์ฃผ์š” MoE ๋ชจ๋ธ

๋ชจ๋ธ ์ด / ํ™œ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ ์ „๋ฌธ๊ฐ€ ๊ตฌ์„ฑ
Shazeer MoE ์ตœ๋Œ€ 137B sparsely-gated FFN
Switch Transformer ์ตœ๋Œ€ 1.6T top-1 ๋ผ์šฐํŒ…
GShard 600B+ top-2 + ์ž๋™ ์ƒค๋”ฉ
Mixtral 8x7B 47B / 13B 8๊ฐœ ์ค‘ top-2
DeepSeekMoE 2B, 16B, 145B ์Šค์ผ€์ผ fine-grained + shared expert
DeepSeek-V3 671B / 37B 256๊ฐœ ์ค‘ top-8 + shared expert

MoE์™€ ๋‹ค๋ฅธ ์ ˆ๊ฐ ๊ธฐ๋ฒ•์˜ ๊ด€๊ณ„

MoE๋Š” ๊ณ„์‚ฐ ํฌ์†Œํ™”์— ์ดˆ์ ์ด ์žˆ๊ณ , ์–‘์žํ™”๋Š” ๋น„ํŠธ ์ˆ˜๋ฅผ ์ค„์ด๋ฉฐ, MLA๋‚˜ KV ์บ์‹œ ์ตœ์ ํ™”๋Š” ๋””์ฝ”๋”ฉ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค. ์„œ๋กœ ๋ชฉ์ ์ด ๋‹ฌ๋ผ์„œ ํ•จ๊ป˜ ์“ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. DeepSeek-V3์ฒ˜๋Ÿผ MLA, MTP, MoE๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์ด ๊ทธ ์˜ˆ์ž…๋‹ˆ๋‹ค.

6. ๊ด€๋ จ ๊ธฐ์ˆ 

์ฐธ๊ณ  ๋ฌธํ—Œ - Shazeer et al. 2017 arXiv:1701.06538, GShard 2020 arXiv:2006.16668, Switch Transformer 2021 arXiv:2101.03961, DeepSeekMoE 2024 arXiv:2401.06066, DeepSeek-V3 arXiv:2412.19437.

7. ํ•ต์‹ฌ ์ •๋ฆฌ

MoE๋Š” ํ† ํฐ๋งˆ๋‹ค ์ผ๋ถ€ ์ „๋ฌธ๊ฐ€๋งŒ ํ™œ์„ฑํ™”ํ•ด ์ด ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ํ™œ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. ๋•๋ถ„์— ๊ณ„์‚ฐ์„ ํฌ๊ฒŒ ๋Š˜๋ฆฌ์ง€ ์•Š๊ณ ๋„ ๋ชจ๋ธ ์šฉ๋Ÿ‰์„ ํ‚ค์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋ผ์šฐํŒ…, ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ, all-to-all ํ†ต์‹ , ๋ฉ”๋ชจ๋ฆฌ ์ ์žฌ ๋ฌธ์ œ๊ฐ€ ํ•จ๊ป˜ ๋”ฐ๋ผ์˜ต๋‹ˆ๋‹ค.

Shazeer์˜ sparsely-gated MoE์—์„œ ์‹œ์ž‘ํ•ด Switch์˜ top-1 ๋‹จ์ˆœํ™”, GShard์˜ ์ž๋™ ์ƒค๋”ฉ, DeepSeekMoE์˜ fine-grained ์ „๋ฌธ๊ฐ€์™€ shared expert๋กœ ๋ฐœ์ „ํ–ˆ์Šต๋‹ˆ๋‹ค. DeepSeek-V3๋Š” auxiliary-loss-free ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ๊นŒ์ง€ ๋”ํ•ด ํ˜„์žฌ MoE ์„ค๊ณ„์˜ ๋Œ€ํ‘œ ์‚ฌ๋ก€๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ตญ MoE๋Š” "๋” ํฐ ๋ชจ๋ธ"์ด ์•„๋‹ˆ๋ผ "๋” ๋˜‘๋˜‘ํ•œ ๊ณ„์‚ฐ ๋ฐฐ๋ถ„"์— ๋Œ€ํ•œ ํ•ด๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์„ฑ๋Šฅ๋งŒ์ด ์•„๋‹ˆ๋ผ ์„œ๋น™ ์ธํ”„๋ผ์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ์„ค๊ณ„๊นŒ์ง€ ํ•จ๊ป˜ ๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.