Hi! I am Dr. Yee Seng Chan. Here, I write about AI and ML, focusing on NLP and LLMs. Throughout my career, I had been an IC, technical lead, manager, PI, and mentor, sometimes serving in multiple parallel capacities. I also have a history of internal and external collaborations (R&D organizations of DoD and National Intelligence, private organizations (e.g. MITRE), and universities).
Short Bio
After my PhD in NLP with Prof. Ng, I did a postdoc with Prof. Roth at UIUC. Over the course of my PhD and postdoc, I published 12 main papers at premier NLP conferences (6 ACL, 3 EMNLP, 1 AAAI, 1 IJCAI, 1 COLING), 1 journal, and a few workshop papers. I am the first author in most of the conference papers.
- In collaboration with David Chiang, our ACL paper showed for the first time that word sense information improves state-of-the-art machine translation performance, helping to settle a 10-year long debate within the academic field.
- My work in word sense disambiguation and MT evaluation metric had also won international benchmark competitions.
I then worked at Raytheon BBN Technologies for 10 years. Raytheon BBN is a R&D company that was awarded the National Medal of Technology and Innovation in 2013. While at Raytheon BBN, I was serving in multiple parallel capacities for several multi-million NLP projects:
- Principal investigator in a project funded by DARPA and the Gates Foundation, where we collaborated with external private organizations and universities.
- R&D lead in a challenging multilingual project where we collaborated with external universities.
- The creator, architect, and main developer of a deep learning NLP R&D system called NLPLingo used in almost all of Raytheon BBN’s NLP projects, helping to bring in millions of revenue. NLPLingo leverages transformers to perform various NLP tasks such as information extraction, named entity recognition, etc.
- For 2 years running, I was also selected and served as an expert advisor and judge for Raytheon Innovation Challenge, on topics regarding “AI, ML, and Expert systems for National Security”.
To get out of my comfort zone, I left BBN to join Quattr, which is a startup focusing on applying NLP to tackle Search Engine Optimization (SEO) for corporate clients such as Coursera, Pinterest, Mcafee, etc. At Quattr, I managed the NLP/ML team, where we applied state-of-the-art NLP techniques, transformers, GPT, prompt engineering, etc. to create solutions for search intent discovery, web page topic discovery, automatic content generation, and internal linking among web pages.
I then moved on to Elemental Cognition (EC). EC was founded by original members of the IBM Watson AI team which won the Jeopardy gameshow, and is funded by Bridgewater and other investors. At EC, I leveraged encoders, GPT, prompt engineering and various transformer models to devise NLP solutions for the biomedical and cyber security domains.
Organization of Blog Posts
Check out Transformer Architecture Explained, then go on to the following.
Transformer Basics
- Subword Tokenization: Byte pair encoding (BPE), WordPiece tokenization, and Unigram tokenization
- Relative Position Embedding
- Rotary Position Embedding
- Perplexity is commonly used as an intrinsic evaluation of a language model
- Likelihood based Generative Models
Deep Neural Network Basics
- Unicode and UTF-8
- Entropy and Cross-Entropy
- Loss Functions
- Activation Functions
- Optimizers
- Linear Algebra
Encoder Language Models
Model | Size | Date | Organization | Description |
---|---|---|---|---|
BERT | 110M, 340M | Oct-2018 | Introduced masked language modeling (MLM) | |
RoBERTa | 125M, 355M | Jul-2019 | UWash, Meta | Replication of BERT with more robust training |
XLM | XLM-100 has 570M | Jan-2019 | Meta | Cross-lingual model that uses translation language modeling (TLM) and MLM |
XLM-R | 270M, 550M | Nov-2019 | Meta | Cross-lingual model that uses MLM with more robust training |
ELECTRA | 110M, 335M | Mar-2020 | Stanford, Google | Introduced “replaced token detection” pretraining, that is more sample efficient than MLM |
DeBERTa | 100M | Oct-2021 | Microsoft | Keeps the content embeddings separate from the relative position embeddings |
DeBERTa-v3 | 86M+98M | Mar-2023 | Microsoft | Modified the replaced token detection (RTD) objective of ELECTRA, and combined it with the disentangled attention approach of DeBERTa |
Decoder Language Models
Model | Size | Date | Organization | Description |
---|---|---|---|---|
GPT-1 | 117M | Jun-2018 | OpenAI | Demonstrated pretraining Transformers enable effective transfer learning to downstream NLP tasks |
GPT-2 | 1.5B | Feb-2019 | OpenAI | Focused on zero-shot in-context prompting |
GPT-3 | 175B | Jul-2020 | OpenAI | Focused on one-shot and few-shot prompting |
MPNet | 110M | Apr-2020 | Nanjing University and Microsoft | Fused and improved on MLM + PLM for pretraining |
FLAN | 137B | Sept-2021 | Shows that multitask fine-tuning of LaMDA-PT improves zero-shot generalization to new tasks | |
GLaM | 1.2T | Dec-2021 | Decoder-only language model that does conditional computation using mixture of experts (MoE) | |
Chinchilla | 70B | Mar-2022 | DeepMind | Shows that number of training tokens should scale equally with model size. Outperforms GPT-3 (175B) |
PaLM | 540B | Apr-2022 | Likely the best decoder-only pretrained model at time of publication | |
FLAN-PaLM | 540B | Oct-2022 | Multitask instruction fine-tuning on PaLM. Likely the best decoder-only model at time of publication, but probably under-trained | |
U-PaLM | 540B | Oct-2022 | Continue training PaLM with the UL2 mixture of denoising pretraining objective | |
LLaMA-1 | 6.7B - 65B | Feb-2023 | Meta | Trained on 1.4T tokens of publicly available texts |
LLaMA-2 | 7B - 70B | Jul-2023 | Meta | Instruction fine-tuned and RLHF |
Mistral 7B | 7B | Oct-2023 | Mistral.ai | Outperforms Llama-2-7B and Llama-2-13B |
Zephyr-7B | 7B | Oct-2023 | Hugging Face | Starts with Mistral-7B, then instruction fine-tuning, then DPO |
Encoder-Decoder Language Models
Model | Size | Date | Organization | Description |
---|---|---|---|---|
T5 | 11B | Oct-2019 | First paper to show text-to-text Transformer achieves SOTA results. Also shows span corruption works well. | |
BART | 406M | Oct-2019 | Meta | Similar to T5. But T5 predicts only the masked spans, whereas BART predicts the complete text |
mT5 | 13B | Oct-2020 | Multilingual version of the T5 model | |
Switch | 3.8B | Jan-2021 | Based on T5, but the original dense FFN is replaced with a sparse Switch FFN layer. | |
T0 | 11B | Oct-2021 | Hugging Face and others | Multitask fine-tuning on T5 improves zero-shot performance on unseen tasks. Performs better than GPT-3 (175B) |
UL2 | 20B | May-2022 | Uses a mixture of denoisers for pretraining | |
Tk-INSTRUCT | 11B | Apr-2022 | University of Washington and others | T5 fine-tuned on 1600+ NLP tasks with written instructions |
Dialog Models
Model | Size | Date | Organization | Description |
---|---|---|---|---|
TransferTransfo | ~117M | Jan-2019 | Hugging Face | Open-source dialog model that takes on persona |
InstructGPT | 175G | Sept-2020 | OpenAI | Leveraged human preferences, reward modeling, and reinforcement learning to improve GPT-3 models. Predecessor of ChatGPT |
LaMDA | 137B | Jan-2022 | Fine-tune decoder model for quality, safety, and groundedness |
Sentence Transformers
Large Scale Evaluation of Language Models
- SuperGLUE: (May-2019) A Stickier Benchmark for General-Purpose Language Understanding Systems
- BEIR: (Apr-2021) An aggregation of 18 datasets for zero-shot evaluation of 10 IR models
- MTEB: (Oct-2022) Benchmark to evaluate sentence-embedding models
- Holistic Evaluation of Language Models: (Nov-2022) A large scale evaluation of 30 language models over a set of 16 scenarios and 7 categories of metrics
Improve Efficiency of Language Models
- Quantization
- PEFT and LoRA
- Time Efficency
- Others
Strategies to Improve Language Models
- Techniques to Enable Learning Deep Neural Networks: Skipped Connection, Layer Normalization, RMSNorm
- Chain-of-Thought (CoT) prompting to elicit reasoning in language models
- Self-Consistency as in inference strategy
- Big Bird Transformer for Longer Sequences
- Permutation Language Modeling
- Fine-Tuning Language Models for Factuality
- Direct Preference Optimization
Search and Retrieval
- Efficient KNN search using product quantization
- Information Retrieval Evaluation Metrics: MAP@K and NDCG
- ColBERT - Passage Search via Contextualized Late Interaction over BERT
- ColBERTv2 - Efficient Passage Retrieval
- REALM - Augment Language Models with a Knowledge Retriever
- RAG related
Contact me
chanys.nlp at gmail.com