LLaMA-1

The LLaMA language model was introduced in the paper “LLaMA: open and efficient foundation language model” by Meta in Feb-2023.

Approach

The authors trained 4 different model sizes as shown in the Table below:

params dimension $n$ heads $n$ layers learning rate batch size $n$ tokens
6.7B 4096 32 32 $3.0e^{-4}$ 4M 1.0T
13.0B 5120 40 40 $3.0e^{-4}$ 4M 1.0T
32.5B 6656 52 60 $1.5e^{-4}$ 4M 1.4T
65.2B 8192 64 80 $1.5e^{-4}$ 4M 1.4T

Pre-training Dataset

The entire training dataset contains ~1.4T tokens after subword tokenization. Except for Wikipedia and Books dataset, which the authors used approximately twice, the rest of the data are used once. The data mixtures used for pre-training are shown below.

Dataset Sampling proportion Epochs Disk size
CommonCrawl 67.0% 1.10 3.3 TB
C4 15.0% 1.06 783 GB
Github 4.5% 0.64 328 GB
Wikpedia 4.5% 2.45 83 GB
Books 4.5% 2.23 85 GB
ArXiv 2.5% 1.06 92 GB
StackExchange 2.0% 1.03 78 GB
  • English CommonCrawl: CommonCrawl dumps 2017-2020 preprocessed with the CCNet pipeline which: deduplicates the data at the line level, performs language ID with a fastText classifier to remove non-English pages, filters low quality content with an n-gram language model.
  • C4: this was the dataset used to train T5.
  • Github: the public GitHub dataset available on Google BigQuery.
  • Wikipedia: dumps from the June-August 2022 period, covering 20 languages. Hyperlinks, comments, and other formatting boilerplate are removed.
  • Gutenberg and Books3: the Gutenberg Project which contains books in the public domain, and the Books3 section of ThePile, a publicly available dataset for training large language models.
  • ArXiv: to add scientific data
  • Stack Exchange: kept the data from the 28 largest websites.

Architecture

The LLaMA authors leveraged various transformer improvements proposed by prior work:

  • Pre-normalization: inspired by GPT-3. Normalize the input of each transformer sub-layer using RMSNorm
  • SwiGLU activation function: inspired by PaLM. Instead of using the usual ReLU activation function, LLaMA uses the SwiGLU function.
  • Rotary embeddings: inspired by GPTNeo. Instead of using the usual absolute positional embeddings, LLaMA uses the rotary positonal embedding at each layer of the network.
  • Gradient checkpointing: to reduce the number of activations that are recomputed while reducing memory footprint.

Cost of pre-training

The authors mentioned that they used A100-80GB for pre-training.

  • The LLaMA-7B, 13B, 33B, 65B used 82K, 135K, 530K, and 1022K GPU hours respectively.
  • The current lowest cost on-demand pricing for the A100 is from Lambda labs, which cost \$1.10 per hour. So the 7B could cost around \$90K to pretrain, while the 65B might cost \$1.1M to pretrain.
Written on July 30, 2023