Big Bird Transformer for Longer Sequences

The self-attention in Transformers allows every token to attend independently to every other token. However, full self-attention requires quadratic computation, in terms of the sequence length. The Big Bird Transformer is proposed in the paper “Big Bird: Transformers for Longer Sequences”, published by Google in July 2020. Big Bird is a sparse attention mechanism that reduces the computation from quadractic to linear, and can handle sequence lengths up to 8x of what was previously possible, while using similar hardware.

BigBird performs self-attention using three main strategies:

A set of $g$ global tokens attending to all parts of the sequence
All tokens attending to a set of $w$ local neighboring tokens. That is, the query at location $i$ attends from $i - \frac{w}{2}$ to $i + \frac{w}{2}$ keys.
All tokens attending to a set of $r$ random tokens

Written on January 27, 2023