LLaMA-1
The LLaMA language model was introduced in the paper “LLaMA: open and efficient foundation language model” by Meta in Feb-2023.
Approach
The authors trained 4 different model sizes as shown in the Table below:
params | dimension | $n$ heads | $n$ layers | learning rate | batch size | $n$ tokens |
---|---|---|---|---|---|---|
6.7B | 4096 | 32 | 32 | $3.0e^{-4}$ | 4M | 1.0T |
13.0B | 5120 | 40 | 40 | $3.0e^{-4}$ | 4M | 1.0T |
32.5B | 6656 | 52 | 60 | $1.5e^{-4}$ | 4M | 1.4T |
65.2B | 8192 | 64 | 80 | $1.5e^{-4}$ | 4M | 1.4T |
Pre-training Dataset
The entire training dataset contains ~1.4T tokens after subword tokenization. Except for Wikipedia and Books dataset, which the authors used approximately twice, the rest of the data are used once. The data mixtures used for pre-training are shown below.
Dataset | Sampling proportion | Epochs | Disk size |
---|---|---|---|
CommonCrawl | 67.0% | 1.10 | 3.3 TB |
C4 | 15.0% | 1.06 | 783 GB |
Github | 4.5% | 0.64 | 328 GB |
Wikpedia | 4.5% | 2.45 | 83 GB |
Books | 4.5% | 2.23 | 85 GB |
ArXiv | 2.5% | 1.06 | 92 GB |
StackExchange | 2.0% | 1.03 | 78 GB |
- English CommonCrawl: CommonCrawl dumps 2017-2020 preprocessed with the CCNet pipeline which: deduplicates the data at the line level, performs language ID with a fastText classifier to remove non-English pages, filters low quality content with an n-gram language model.
- C4: this was the dataset used to train T5.
- Github: the public GitHub dataset available on Google BigQuery.
- Wikipedia: dumps from the June-August 2022 period, covering 20 languages. Hyperlinks, comments, and other formatting boilerplate are removed.
- Gutenberg and Books3: the Gutenberg Project which contains books in the public domain, and the Books3 section of ThePile, a publicly available dataset for training large language models.
- ArXiv: to add scientific data
- Stack Exchange: kept the data from the 28 largest websites.
Architecture
The LLaMA authors leveraged various transformer improvements proposed by prior work:
- Pre-normalization: inspired by GPT-3. Normalize the input of each transformer sub-layer using RMSNorm
- SwiGLU activation function: inspired by PaLM. Instead of using the usual ReLU activation function, LLaMA uses the SwiGLU function.
- Rotary embeddings: inspired by GPTNeo. Instead of using the usual absolute positional embeddings, LLaMA uses the rotary positonal embedding at each layer of the network.
- Gradient checkpointing: to reduce the number of activations that are recomputed while reducing memory footprint.
Cost of pre-training
The authors mentioned that they used A100-80GB for pre-training.
- The LLaMA-7B, 13B, 33B, 65B used 82K, 135K, 530K, and 1022K GPU hours respectively.
- The current lowest cost on-demand pricing for the A100 is from Lambda labs, which cost \$1.10 per hour. So the 7B could cost around \$90K to pretrain, while the 65B might cost \$1.1M to pretrain.
Written on July 30, 2023