Zephyr

Two weeks after the release of Mistral-7B, researchers at HuggingFace release “Zephyr: direct distillation of LM alignment” on 10/25/2023. This is the most capable 7B language model at the time of its release. The Zephyr authors started with the Mistral-7B model, then performed instruction fine-tuning with the UltraChat dataset, then performed preference optimization using DPO on the UltraFeedback dataset.

Mistral 7B

On 10/10/2023, researchers from the newly formed Mistral.ai introduced a paper titled “Mistral 7B”, which described a 7B language model.

The model leverages grouped-query attention and sliding window attention for improved inference speed and reduced memory consumption.
The authors evaluated on a wide variety of tasks that can be categorized as: commonsense reasoning, world knowledge, reading comprehension, math, code, popular aggregated results.
The following Figure from the paper shows that Mistral 7B outperforms Llama-2-7B and Llama-2-13B.

Fine-Tuning Language Models for Factuality

A recent paper “Fine-tuning language models for factuality” on 11/14/2023, shows that it is possible to fine-tune language models to improve factuality generation. In particular, the authors generated their own factuality datasets and used the recently introduced direct preference optimization (DPO) method to fine-tune LLMs.

Direct Preference Optimization

Existing methods typicaly steer LMs to match human preferences using reinforcement learning from human feedback (RLHF). This (i) fit a reward model $r$ to a dataset of human preferences, (ii) then use RL to optimize a language model policy to produce responses that would be assigned high reward by $r$, while not drifting excessively far from the original model. This paper “Direct preference optimization: your language model is secretly a reward model” on 5/29/2023 eliminates the need to fit a reward model and directly fine-tune LMs to align with human preferences.

RA-DIT - Retrieval-Augmented Dual Instruction Tuning

Recent retrieval augmented generation (RAG) models tend to perform joint training of the retriever and generator. In contrast, a recent paper “RA-DIT: retrievial-augmented dual instruction tuning” by Meta on 10/2/2023 proposed a lightweight approach. Here, the authors performed two distinct fine-tuning steps: (1) update a pretrained LLM to use retrieved information, (2) update the retriever to return results as preferred by the LLM. I.e. the RA-DIT approach separately fine-tunes the LLM and the retriever.

RAG-end2end for Domain Specific QA

When the original RAG model was introduced, the passage encoding and indexing are fixed, since re-encoding the external knowledge base passages during training is expensive. Despite this, the original RAG model performed well when evaluated on Wikipedia associated evaluation sets, since the dense passage retriever (DPR) used there had been trained on Wikipedia-based datasets. This paper, “Improving the domain adapation of retrieval augmented generation (RAG) models for open domain question answering” from 10/6/2022 in contrast, explores using RAG for domain-specific QA.

Code Example on Instruction Fine-tuning of llama2-7B using LoRA

In earlier articles we discussed instruction fine-tuning, LoRA and quantization. We now tie these concepts and show an example code where we perform instruction fine-tuning of llama2-7B using LoRA. This was done on a A5000 GPU with 24GB of ram.

DeBERTa-v3

In Mar-2023, researchers from Microsoft proposed combing DeBERTa and ELECTRA, to result in DeBERTa-v3, describing their approach in the paper “DeBERTa-v3: Improving DeBERTa using ELECTRA-style pre-training with gradient disentangled embedding sharing”. This paper modified the replaced token detection (RTD) objective of ELECTRA, and combined it with the disentangled attention approach of DeBERTa. The authors showed that DeBERTa-v3 performs better than BERT, DeBERTa, RoBERTa, and XLM-R.

DeBERTa

In Oct-2021, researchers from Microsoft introduced the DeBERTa encoder model in “Deberta: decoding-enhanced BERT with disentangled attention”, which performs better BERT and RoBERTa. The main contribution of DeBERTa is in introducing new/separate embeddings for relative positions. This is in contrast with the usual encoder transformers where position information is additive to the word/content embeddings at input time. Specifically, DeBERTa:

Keeps the content embeddings separate from the relative position embeddings. Deberta introduces new relative-position projection matrices $W$ for the query and key.
When calculating self-attention, besides considering content-to-content dot-product for self-attention score, Deberta also includes content-to-position and position-to-content
Due to the above, position information is supplied to each transformer layer. Constrast this with the usual transformer, where the position information is given as additive to the input/content embeddings only at the very beginning.

Retrieval-Augmented Generation (RAG)

Although language models are becoming more capable, providing provenence and updating their knowledge are still problematic. Hence, researchers introduced the retrieval-augmented generation (RAG) approach in the paper “Retrieval-augmented generation for knowledge-intensive NLP tasks” in April-2021, as a means to introduce and update the parameteric knowledge pre-trained language models.

Dense Passage Retrieval (DPR)

To retrievel relevant passages for answering queries for questions, traditional methods rely on sparse vector space methods based on TF-IDF or BM25. In the paper “Dense passage retrieval for open-domain question answering” published in 2020, researchers show that leveraging the BERT transfomer to encode both the question and passages, then fine-tuning to encoder weights to maximize the dot-product similarity between positive question-passage pairs, result in a “dense passage retrieval” model that significantly out-performs BM25.

GPTQ

A quantization method that has been gaining popularity is GPTQ, which does post-training quantization of language models. GPTQ was introduced in the paper “GPTQ: accurate post-training quantization for generative pre-trained transformers” in Mar-2023. The name GPTQ stands for Generative Post-Training Quantization.

LLaMA-2

The LLaMA-2 model was introduced in the paper “LLaMA-2: open foundation and fine-tuned chat models” by Meta in Jul-2023. Similar to LLaMA-1, the LLaMA-2 model also applied pre-normalization using RMSNorm, use the SwiGLU activation function, and rotary positional embeddings. However, LLaMA-2 differs from LLaMA-1 in the following aspects:

LLaMA-1 was trained on up to 1.4T tokens and has a context length of 2k. LLaMA-2 was trained on 2k tokens, and has a context length of 4k.
Grouped-Query attention: A standard practice for autoregressive decoding is to cache the key (K) and value (V) pairs once they are computed for the previous tokens in the sequence, speeding up attention computations. However, caching the KV pairs require extra memory. Also, once the computations are sped up, reading and writing to and from the GPU memory becomes the bottleneck. Hence, LLaMA-2 leveraged the multi-query attention (MQA) approach.

LLaMA-1

The LLaMA language model was introduced in the paper “LLaMA: open and efficient foundation language model” by Meta in Feb-2023.

Multi-query and Grouped Multi-query Attention

Multi-query Attention

Key-Value Caching

At each time-step of a generative model, we just want to calculate the attention scores for the new token. To avoid re-calculating attention scores associated with previous (already generated) tokens, we apply Key-Value (KV) caching.

Rotary Position Embedding

The rotary position embedding method was introduced in the paper “RoFormer: Enhanced Transformer with Rotary Position Embedding” in April-2021.

Gradient Check-Pointing

Gradient check-pointing is one of the techniques we can use to reduce the memory footprint when training transformer models. To compute the forward pass and backward pass for a compute graph, a usual strategy is to use and compute values “as soon as possible”. For instance, in the following figure which represents a computation graph, the forward pass (top row) activations are computed (once) and then stored in memory. This allows input to the backward pass (bottom row) computations. However, storing all activations is memory intensive.

Quantization (16-bit, 8-bit, 4-bit) and QLoRA

Quantization is another technique to reduce the memory footprint of transformer models. In this article, we first discuss how real numbers are represented in computers as a binary sequence and the memory requirements of transformer models. Next, we describe quantization using 16-bit, 8-bit, and finally 4-bit using qLoRA.

LoRA

LoRa was introduced in the paper “LoRA: Low-Rank Adaptation of Large Language Models” by researchers from Microsoft in June-2021.

Parameter Efficient Fine Tuning (PEFT)

Assume that you want to fine-tune a large pretrained model for a number of different tasks. The traditional options are:

Fine-tune the pretrained model on each task separately. But you will then be storing a separate copy of each fine-tuned model for each task.
Assuming that the various tasks are not entirely different, then you can attempt to sort them in some linear order (e.g. in terms of increasing difficulty): fine-tune the model on task-1, then task-2, then task-3. But this runs the risk of catastrophic forgetting on earlier tasks.

Linear Algebra

Here we will review the basis of linear algebra, which should help towards understand techniques such as Low Rank Adaptation (LoRA).

Tk-INSTRUCT - Encoder-Decoder fine-tuned on 1600+ NLP tasks

The Tk-INSTRUCT encoder-decoder language model is based on T5 11B, and is fine-tuned on a large dataset of 1,616 diverse NLP tasks with written instructions. It is described in the paper “Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks”, published in April 2022. In the paper, the authors also built a multi-lingual version of their model, mTk-INSTRUCT, and shows that their proposed models outperform InstructGPT on their dataset.

U-PaLM - UL2 Mixture of Denoisers for PaLM Decoder

The authors continued training the decoder-only PaLM model with the mixture of denoising UL2 pretraining objective, giving a U-PaLM model. U-PaLM was described in the paper “Transcending Scaling Laws with 0.1% Extra Compute” from Google, published in October 2022.

UL2 - Mixture of Denoisers for Pretraining

The UL2 encoder-decoder model with a mixture of denoising objectives, was introduced in the paper “UL2: Unifying Language Learning Paradigms” from Google, published in May 2022. The aim of UL2 is to pretrain a model so that it works better than the autoregressive GPT-like models and the encoder-decoder T5 models, across many different NLP tasks. In evaluations, UL2 20B performs better than GPT-3 175B and T5 11B, but still lags behind the large PaLM 540B model.

Holistic Evaluation of Language Models

The paper “Holistic Evaluation of Language Models” from Stanford, published in November 2022, is a large scale evaluation of 30 language models over a set of 16 scenarios and 7 categories of metrics.

Model Quantization

To reduce inference runtime, we can also perform quantization, which converts 32-bit floating points to 8-bit integers. This makes inference computation more efficient and reduces memory consumption. When quantizing deep neural models weights, we are distributing the (relatively narrow) range of floating points to a range of integers, clamping any outliers, and then rounding to whole numbers.

Knowledge Distillation

Knowledge distillation is a general purpose method for training a smaller student model to mimic the behaviour of a slower, larger, but better performing teacher. It was popularized in a 2015 paper (Distilling the knowledge in a neural network. G. Hinton et al. 2015) that generalized the method to deep neural networks. The main idea is to augment the ground truth labels with a distribution of “soft probabilities” from the teacher which provides complementary information for the student to learn from. We describe knowledge distillation and DistilBERT in this article.

MTEB Benchmark Dataset to Evaluate Sentence Embedding Models

The MTEB dataset primarily aims to evaluate (33) models’ ability to embed sentences or paragraphs. MTEB includes 8 different tasks over 56 datasets (of which 10 are multilingual), covering 112 different languages. Both sentence-level and paragraph-level data are included. The MTEB dataset was introduced in the paper “MTEB: Massive Text Embedding Benchmark”, published in October 2022.

SGPT - GPT Sentence Embeddings

The SGPT paper fine-tunes GPT-style decoder-only models on pairwise sentence datasets, such that they can produce effective sentence embeddings. Related work are Sentence-BERT and the all-mpnet-base-v2 model, which are sentence embedding models based on encoder Transformers. The SGPT paper “SGPT: GPT Sentence Embeddings for Semantic Search” was published in February 2022, and leverged open source GPT-style models: GPT-neo and GPT-J from EleutherAI.

Sentence Embeddings Using Siamese Networks and all-mpnet-base-v2

A sentence embedding is a single vector that captures the semantic meaning of a piece of text, usually a single sentence or a paragraph. To derive an effective sentence embedding, the ACL-2019 paper “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” proposed leveraging a Transformer-based Siamese network.

MPNet - Masked and Permutated Language Modeling

A new pretraining method, Masked and Permutated Language Modeling (MPNet) was introduced in the paper “MPNet: Masked and Permuted Pre-training for Language Understanding”, published in April 2020. It was meant to fuse and address the deficiencies in masked language modeling (MLM) and Permutation Language Modeling (PLM). Experiments show that it outperforms MLM and PLM, and achieves SOTA performance on various NLP datasets such as GLUE, SQuAD, etc.

Permutation Language Modeling

The Permutation Language Modeling (PLM) pretraining objective was introduced in the paper “XLNet: Generalized Autoregressive Pretraining for Language Understanding”, published in June 2019. It tries to address the independence and noise issues that arise from the masked language modeling (MLM) pretraining objective of BERT, while retaining its advantage of utilizing bidirectional context. The core idea is to sample many permutations of the same input sequence, and train on these in an autoregressive manner. In expectation, each token will then have learnt from all other tokens in the input context, while avoiding the usage of masking.

SWITCH - Sparsely Activated Encoder-Decoder Language Model

The SWITCH model was described in the paper “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”, published in January 2021. It is a sparsely activated expert model, i.e. activating a subset of the NN weights for each incoming example. The authors claimed this simplifies and improves over the Mixture of Experts (MoE) architecture.

GLaM - MoE Decoder Language Model

The GLaM model (Generalist Language Models) was described in the paper “GLaM: Efficient Scaling of Language Models with Mixture-of-Experts”, published in December 2021. It is a decoder-only language model that does conditional computation using mixture of experts (MoE).

Chinchilla - A Compute Optimal Decoder Language Model

The Chinchilla model was described in “Training Compute-Optimal Large Language Models” published in March 2022. Introduced the Chinchilla model, an auto-regressive language model (70B model trained on 1.4T tokens) which performed better than GPT-3 (175B trained on 300B tokens) and Gopher (their prior work, an auto-regressive 280B model trained on 300B tokens).

ELECTRA Encoder Language Model

The ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) model was described in “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators” published in March 2020.

FLAN-PaLM - Fine-Tuned Decoder Language Model

The Flan-PaLM model was described in the paper “Scaling Instruction-Finetuned Language Models” published in October 2022.

PaLM - Decoder Language Model

The PaLM (Pathways Language Model) model was introduced in the paper “PaLM: Scaling Language Modeling with Pathways” published in April 2022. This is a 540B parameter autoregressive decoder language model, trained on 780B text tokens.

BART Encoder-Decoder Language Model

BART is a language model from Meta, described in the paper “BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension”, published in October 2019. It is most similar to the T5 model, which is also an encoder-decoder Transformer.

T0 - Fine-Tuned Encoder-Decoder Language Model

The T0 model was introduced in the paper “Multitask Prompted Training Enables Zero-Shot Task Generalization” published in October 2021. The setup of T0 is very similar to the FLAN model from Google (published one month earlier in September 2021). The main difference being that T0 is based off the T5 model (encoder-decoder), while FLAN is based off LaMDA-PT (Google’s decoder-only language model).

FLAN - Fine-Tuned Decoder Language Model

The FLAN model (Finetuned Language Net) from Google was introduced in the paper “Finetuned Language Models Are Zero-Shot Learners” published in September 2021. Basically shows that performing multitask fine-tuning improves zero-shot generalization to new tasks (i.e. tasks not included in fine-tuning).

mT5 Multilingual Encoder-Decoder Language Model

The mT5 language model was introduced in the paper “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer” published in October 2020. This is a multilingual version of the T5 model. Their largest model (13B XXL) exceeds SOTA in all classification and QA tasks, and near SOTA for NER. In general, mT5 is relatively weak on NER, requiring usage of the mT5-XL (3.7B) model to exceed XLM-R (550M parameters) on NER.

T5 Encoder-Decoder Language Model

The T5 (Text-to-Text Transfer Transformer) model from Google is introduced in the paper “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”, published in October 2019.

Big Bird Transformer for Longer Sequences

The self-attention in Transformers allows every token to attend independently to every other token. However, full self-attention requires quadratic computation, in terms of the sequence length. The Big Bird Transformer is proposed in the paper “Big Bird: Transformers for Longer Sequences”, published by Google in July 2020. Big Bird is a sparse attention mechanism that reduces the computation from quadractic to linear, and can handle sequence lengths up to 8x of what was previously possible, while using similar hardware.

REALM - Augment Language Models with a Knowledge Retriever

The Retrieval-Augmented Language Model (REALM) is described in the paper “REALM: Retrieval-Augmented Language Model Pre-Training”, published by Google in Feburary 2020. REALM augments language models with a knowledge retriever, such that during pretraining, fine-tuning, and inference, the language model is able to retrieve and attend over text documents from an external corpus. This has two benefits: (i) without such an external knowledge source, the parameters of the language model is the sole source of all learned knowledge thus requiring larger and larger model size to store increasing more knowledge, (ii) without an external knowledge source, the trained language model is inherently static.

TransferTransfo Dialog Model

The TransferTransfo generative model is a dialog system (chatbot) from Huggingface, described in the paper “TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents”, published in 2019.

LaMDA Decoder Dialog Model

Google’s LaMDA (Language Models for Dialog Applications) model, is a decoder-based Transformer dialog model that is designed to produce dialog responses that are high quality, safe, and grounded. It was introduced in the paper “LaMDA: Language Models for Dialog Applications” published in Janurary 2022.

How was ChatGPT Trained?

ChatGPT was built on top of the InstructGPT paper “Training language models to follow instructions with human feedback” published in March 2022.

Self-Consistency Inference

The self-consistency (CT) inference strategy was introduced in the paper “Self-Consistency Improves Chain of Thought Reasoning in Language Models”, published in March 2022. In essence, the CT approach is simply to perform multiple inferences, and use the most frequent answer.

Chain-of-Thought (CoT) Prompting

Chain-of-Thought is a prompting strategy introduced in the paper “Chain of thought prompting elicits reasoning in large language models”, published in January 2022. CoT prompting improves few-shot performance of various language models and was used in the FLAN-PaLM model.

GPT-3 Decoder Language Model

The GPT-3 language model was published in the paper “Language Models are Few-Shot Learners” in July 2020. GPT-3 is a 175 billion parameter pre-trained language model. In the GPT-3 paper, the authors did not perform fine-tuning (in any case, it will be very expensive to fine tune such a large model). Instead, the authors experimented with zero-shot, one-shot, and few-shot (few-shot is usually 10-100 examples) prompting. For all these experimental settings, there is no fine-tuning of model weights. Instead, the task description is simply provided in the input prompt to the model.

GPT-2 Decoder Language Model

The GPT-2 language model was published in the paper “Language Models are Unsupervised Multitask Learners” in Feburary 2019. The GPT-2 paper has 2 main differences with the GPT-1 paper. First, GPT-2 experimented with various model sizes, ranging from 117M parameters (same size as GPT-1), to 1.5G parameters. Second, instead of performing fine-tuning on downstream NLP tasks as was done in GPT-1, the GPT-2 paper focused on zero-shot evaluation.

GPT-1 Decoder Language Model

The GPT-1 language model was introduced in the paper “Improving Language Understanding by Generative Pre-Training” in June 2018. The major contributions of GPT-1 are the following:

Prior to GPT-1, it wasn’t clear or demonstrated that pre-training on Transformers would enable effective transfer learning to downstream tasks. The GPT-1 paper demonstrated that this approach of Transformer pre-training and fine-tuning works to produce SOTA results on various NLP tasks.
They also showed that for fine-tuning downstream NLP tasks, instead of building task-specific model architectures, it is possible to perform task-specific input transformations and then generically stack on a linear classification layer on top of the pre-trained transformer.

Transformer Architecture Explained

In 2018, Two transformer models were released that combined self-attention with transfer learning capabilities, opening the floodgate of using Transformers in NLP and propelled introduction of subsequent language models:

GPT: “Improving language understanding by generative pre-training” (Radford et al. 2018). Uses decoder part of Transformer to predict words in an autoregressive manner.
BERT: “BERT: pre-training of deep bidirectional transformers for language understanding” (Devlin et al. 2018). Uses encoder part of Transformer and performs masked language modeling (MLM). These models open the floodgate of using Transformers in NLP and propelled introduction of subsequent language models. This article describes the Transformer architecture.

Techniques to Enable Deep Neural Networks

To train deep neural networks, we require techniques to stabilize training and reduce problems such as vanishing gradients. In this article, we discuss Skipped Connection and Layer Normalization.

SuperGLUE Benchmark Dataset

The authors noted that system performance on their previously introduced GLUE benchmark dataset, has surpassed the level of non-expert humans. Thus, they introduced SuperGLUE, a new benchmark dataset in the paper “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems”, published in 2019.

BEIR Dataset for Zero-shot Evaluation of IR Models

The BEIR information retrievel (IR) dataset was introduced in the paper “BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models”, published in 2021. The paper puts together 18 publicly available datasets, to evaluate 10 IR models. The task is: given a query, retrieve the relevant passages/documents as a ranked list. Evaluate using nDCG@10.

ColBERTv2 - Efficient Passage Retrieval

The ColBERTv2 neural search model is a follow-up to their 2020 ColBERT work, where the architecture remains similar, but leverages compression techniques to achieve 6-10x storage reduction as compared to ColBERT. It was introduced in the paper “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction”, published in 2022.

ColBERT - Passage Search via Contextualized Late Interaction over BERT

ColBERT is a neural passage search model that was introduced in the paper “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT”, published in 2020.

Relative Position Embedding

The idea of learning relative position embeddings was introduced in the paper “Self-Attention with Relative Position Representations” by Google, published in 2018.

XLM-R

The XLM-R is a multilingual encoder-based language model described in the paper “Unsupervised Cross-lingual Representation Learning at Scale”, published by Facebook in November 2019. The model is based on the multilingual XLM, but inspired by techniques from the RoBERTa model. XLM-R handles 100 languages, was pretrained on 2.5TB of filtered CommonCrawl data, and significantly outperforms multilingual BERT (mBERT) on various multilingual evaluation datasets.

XLM

The XLM cross-lingual language model was described in the paper “Cross-lingual Language Model Pretraining”, published by Facebook in Janurary 2019.

RoBERTa

The RoBERTa encoder model was described in the paper “RoBERTa: A Robustly Optimized BERT Pretraining Approach” published in July 2019. It is a collaborative effort between University of Washington and Facebook. RoBERTa is essentially a replication of BERT, but with (i) longer training, bigger batch size, pretrained over more data, (ii) with Next Sentence Prediction (NSP) objective removed, (iii) training on longer sequences, (iv) dynamic masking. The authors show that RoBERTa significantly outperforms BERT the GLUE, SQuAD, and RACE dataset.

BERT

BERT uses just the encoder stack of the Transformer. It was described in the paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, published in October 2018. It showed that its bidirectional masked language modeling (MLM) pretraining objective allows for better downstream fine-tuning as compared to the autoregressive GPT-1.

Subword Tokenization

In this article, we describe Byte pair encoding (BPE), WordPiece tokenization, and Unigram tokenization. One of the first steps in processing a piece of text, is to tokenize it, i.e. split it into “chunks”. One simple method is to split on spaces, but (i) some languages like Chinese and Arabic do not come with spaces, (ii) merely splitting on spaces will result in a very large vocabulary size, forcing models to have very large embedding matrices. On the other end of the spectrum, we could split into individual characters. However, it is hard to learn meaningful representations on individual characters. Transformer models use subword tokenization, which splits single words into one or more subwords.

Unicode and UTF-8

A computer work directly with bits. We need an encoding scheme to convert between bit strings and human readable characters. In ASCII encoding, strings of 8 bits are mapped to a set of 128 characters. But this is insufficient to represent all characters of all languages. Thus the need for Unicode and encoding schemes such as UTF-8.

Likelihood based Generative Models

When training language models, we optimize for the likelihood of the underlying training corpus. In this article, we describe this likelihood function.

Perplexity

Perplexity is commonly used to quantify the quality of a language model.

Entropy and Cross-Entropy

Cross-entropy is often used in machine learning as a loss function. We describe some technical foundations of entropy and cross-entropy in this article.

Information Retrieval Evaluation Metrics

We define two common information retrieval (IR) metrics: MAP@K and NDCG.

Optimizers

An optimizer helps to update network parameters as training iterations proceed. Here we describe Gradient Descent (SGD), SGD with momentum, RMSProp, and Adam.

Activation Functions

Activation functions define the output of a node given its input. We describe here the Sigmoid, Tanh, ReLU, Leaky ReLU, ELU, and GELU.

Loss Functions

A loss function evaluates and helps to quantify the difference between the model’s predictions against the labels. We describe MSE, logistic loss, cross-entropy loss, and contrastive learninig.

KNN Search

The ability to conduct efficient K-nearest neighbor (KNN) search is very important. Example applications are top-K web search results, clustering, etc. One technique for KNN is product quantization, which we discuss in this article. But first, some tibits of relevant information: