About

Hi! I am Dr. Yee Seng Chan. Here, I write about AI and ML, focusing on NLP and LLMs. Throughout my career, I had been an IC, technical lead, manager, PI, and mentor, sometimes serving in multiple parallel capacities. I also have a history of internal and external collaborations (R&D organizations of DoD and National Intelligence, private organizations (e.g. MITRE), and universities).

Short Bio

After my PhD in NLP with Prof. Ng, I did a postdoc with Prof. Roth at UIUC. Over the course of my PhD and postdoc, I published 12 main papers at premier NLP conferences (6 ACL, 3 EMNLP, 1 AAAI, 1 IJCAI, 1 COLING), 1 journal, and a few workshop papers. I am the first author in most of the conference papers.

In collaboration with David Chiang, our ACL paper showed for the first time that word sense information improves state-of-the-art machine translation performance, helping to settle a 10-year long debate within the academic field.
My work in word sense disambiguation and MT evaluation metric had also won international benchmark competitions.

I then worked at Raytheon BBN Technologies for 10 years. Raytheon BBN is a R&D company that was awarded the National Medal of Technology and Innovation in 2013. While at Raytheon BBN, I was serving in multiple parallel capacities for several multi-million NLP projects:

Principal investigator in a project funded by DARPA and the Gates Foundation, where we collaborated with external private organizations and universities.
R&D lead in a challenging multilingual project where we collaborated with external universities.
The creator, architect, and main developer of a deep learning NLP R&D system called NLPLingo used in almost all of Raytheon BBN’s NLP projects, helping to bring in millions of revenue. NLPLingo leverages transformers to perform various NLP tasks such as information extraction, named entity recognition, etc.
For 2 years running, I was also selected and served as an expert advisor and judge for Raytheon Innovation Challenge, on topics regarding “AI, ML, and Expert systems for National Security”.

To get out of my comfort zone, I left BBN to join Quattr, which is a startup focusing on applying NLP to tackle Search Engine Optimization (SEO) for corporate clients such as Coursera, Pinterest, Mcafee, etc. At Quattr, I managed the NLP/ML team, where we applied state-of-the-art NLP techniques, transformers, GPT, prompt engineering, etc. to create solutions for search intent discovery, web page topic discovery, automatic content generation, and internal linking among web pages.

I then moved on to Elemental Cognition (EC). EC was founded by original members of the IBM Watson AI team which won the Jeopardy gameshow, and is funded by Bridgewater and other investors. At EC, I leveraged encoders, GPT, prompt engineering and various transformer models to devise NLP solutions for the biomedical and cyber security domains.

Organization of Blog Posts

Check out Transformer Architecture Explained, then go on to the following.

Transformer Basics

Subword Tokenization: Byte pair encoding (BPE), WordPiece tokenization, and Unigram tokenization
Relative Position Embedding
Rotary Position Embedding
Perplexity is commonly used as an intrinsic evaluation of a language model
Likelihood based Generative Models

Deep Neural Network Basics

Encoder Language Models

Model	Size	Date	Organization	Description
BERT	110M, 340M	Oct-2018	Google	Introduced masked language modeling (MLM)
RoBERTa	125M, 355M	Jul-2019	UWash, Meta	Replication of BERT with more robust training
XLM	XLM-100 has 570M	Jan-2019	Meta	Cross-lingual model that uses translation language modeling (TLM) and MLM
XLM-R	270M, 550M	Nov-2019	Meta	Cross-lingual model that uses MLM with more robust training
ELECTRA	110M, 335M	Mar-2020	Stanford, Google	Introduced “replaced token detection” pretraining, that is more sample efficient than MLM
DeBERTa	100M	Oct-2021	Microsoft	Keeps the content embeddings separate from the relative position embeddings
DeBERTa-v3	86M+98M	Mar-2023	Microsoft	Modified the replaced token detection (RTD) objective of ELECTRA, and combined it with the disentangled attention approach of DeBERTa

Decoder Language Models

Model	Size	Date	Organization	Description
GPT-1	117M	Jun-2018	OpenAI	Demonstrated pretraining Transformers enable effective transfer learning to downstream NLP tasks
GPT-2	1.5B	Feb-2019	OpenAI	Focused on zero-shot in-context prompting
GPT-3	175B	Jul-2020	OpenAI	Focused on one-shot and few-shot prompting
MPNet	110M	Apr-2020	Nanjing University and Microsoft	Fused and improved on MLM + PLM for pretraining
FLAN	137B	Sept-2021	Google	Shows that multitask fine-tuning of LaMDA-PT improves zero-shot generalization to new tasks
GLaM	1.2T	Dec-2021	Google	Decoder-only language model that does conditional computation using mixture of experts (MoE)
Chinchilla	70B	Mar-2022	DeepMind	Shows that number of training tokens should scale equally with model size. Outperforms GPT-3 (175B)
PaLM	540B	Apr-2022	Google	Likely the best decoder-only pretrained model at time of publication
FLAN-PaLM	540B	Oct-2022	Google	Multitask instruction fine-tuning on PaLM. Likely the best decoder-only model at time of publication, but probably under-trained
U-PaLM	540B	Oct-2022	Google	Continue training PaLM with the UL2 mixture of denoising pretraining objective
LLaMA-1	6.7B - 65B	Feb-2023	Meta	Trained on 1.4T tokens of publicly available texts
LLaMA-2	7B - 70B	Jul-2023	Meta	Instruction fine-tuned and RLHF
Mistral 7B	7B	Oct-2023	Mistral.ai	Outperforms Llama-2-7B and Llama-2-13B
Zephyr-7B	7B	Oct-2023	Hugging Face	Starts with Mistral-7B, then instruction fine-tuning, then DPO

Encoder-Decoder Language Models

Model	Size	Date	Organization	Description
T5	11B	Oct-2019	Google	First paper to show text-to-text Transformer achieves SOTA results. Also shows span corruption works well.
BART	406M	Oct-2019	Meta	Similar to T5. But T5 predicts only the masked spans, whereas BART predicts the complete text
mT5	13B	Oct-2020	Google	Multilingual version of the T5 model
Switch	3.8B	Jan-2021	Google	Based on T5, but the original dense FFN is replaced with a sparse Switch FFN layer.
T0	11B	Oct-2021	Hugging Face and others	Multitask fine-tuning on T5 improves zero-shot performance on unseen tasks. Performs better than GPT-3 (175B)
UL2	20B	May-2022	Google	Uses a mixture of denoisers for pretraining
Tk-INSTRUCT	11B	Apr-2022	University of Washington and others	T5 fine-tuned on 1600+ NLP tasks with written instructions

Dialog Models

Model	Size	Date	Organization	Description
TransferTransfo	~117M	Jan-2019	Hugging Face	Open-source dialog model that takes on persona
InstructGPT	175G	Sept-2020	OpenAI	Leveraged human preferences, reward modeling, and reinforcement learning to improve GPT-3 models. Predecessor of ChatGPT
LaMDA	137B	Jan-2022	Google	Fine-tune decoder model for quality, safety, and groundedness

Sentence Transformers

Large Scale Evaluation of Language Models

SuperGLUE: (May-2019) A Stickier Benchmark for General-Purpose Language Understanding Systems
BEIR: (Apr-2021) An aggregation of 18 datasets for zero-shot evaluation of 10 IR models
MTEB: (Oct-2022) Benchmark to evaluate sentence-embedding models
Holistic Evaluation of Language Models: (Nov-2022) A large scale evaluation of 30 language models over a set of 16 scenarios and 7 categories of metrics

Improve Efficiency of Language Models

Quantization
PEFT and LoRA
Time Efficency
Others
- Knowledge Distillation and DistilBERT

Strategies to Improve Language Models

Techniques to Enable Learning Deep Neural Networks: Skipped Connection, Layer Normalization, RMSNorm
Chain-of-Thought (CoT) prompting to elicit reasoning in language models
Self-Consistency as in inference strategy
Big Bird Transformer for Longer Sequences
Permutation Language Modeling
Fine-Tuning Language Models for Factuality
Direct Preference Optimization

Search and Retrieval

Contact me

chanys.nlp at gmail.com