SWITCH - Sparsely Activated Encoder-Decoder Language Model
The SWITCH model was described in the paper “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”, published in January 2021. It is a sparsely activated expert model, i.e. activating a subset of the NN weights for each incoming example. The authors claimed this simplifies and improves over the Mixture of Experts (MoE) architecture.
Comparison with T5 Base
- The SWITCH Transformer model uses a sparse T5 encoder-decoder architecure, where the original dense FFN is replaced with a sparse Switch FFN layer.
- The authors show that SWITCH transformers are more sample efficient, achieving the same levels of perplexity 2.5x quicker when compared with T5-Base which has the same amount of computations.
- However, although the SWITCH-Base is more sample efficient than T5-Base, and also performs better than T5-Base in fine-tuning experiments, but the SWITCH-Base has 17x more parameters than T5-Base. When SWITCH-Base is distilled down (so there is a distillation process) to the same number of parameters as T5-Base, it is still better than T5-Base, but it lost most of its performance advantage.
Sparse Activation
We now compare and contrast between MoE routing, vs SWITCH routing (proposed in this paper).
Mixture of Expert (MoE) routing
- Given input token representation $x$, MoE routes this to top-k experts $\tau$ out of N experts.
- The router variable $W_r$ produces logits $h(x) = W_r \cdot x$, which are normalized via a softmax over the N available experts at that layer: \(p_i(x) = \frac{e^{h(x)_i}}{\sum_{j}^N e^{h(x)_j}}\)
- The output of the layer is the linear combination of each expert’s computation $E_i(x)$ on the token, multiplied by the gate value $p_i(x)$: \(y = \sum_{i \in \tau} p_i(x) E_i(x)\)
Switch routing
- Route to only a single expert, referred to as a Switch layer.
- The SWITCH Transformer encoder block is illustrated below. The original dense FFN is replaced with a sparse Switch FFN layer (light blue). The Switch FFN layer returns the output of the selected FFN expert, multipled by the routher gate value (dotted line)
Evaluation
- SWITCH Transformers are more sample efficient when compared to the T5 transformer, achieving a lower loss with fewer training steps. SWITCH-Base is designed to use the same amount of computations as T5-Base:
- The authors designed FLOP matched SWITCH transformers: SWITCH-Base to match T5-Base, and SWITCH-Large to match T5-Large. On fine-tuning results, both SWITCH transformers out-perform their respective T5 models.
- But the SWITCH-Base has a total of 3800M parameters vs T5-Base which only has 223M parameters. The authors designed a distilled version of SWITCH-Base (to have 223M parameters, same as T5-Base), which is slightly out-performing T5-Base on SuperGLUE.
Written on February 20, 2023