About

Hi! I am Dr. Yee Seng Chan. Here, I write about AI and ML, focusing on NLP and LLMs. Throughout my career, I had been an IC, technical lead, manager, PI, and mentor, sometimes serving in multiple parallel capacities. I also have a history of internal and external collaborations (R&D organizations of DoD and National Intelligence, private organizations (e.g. MITRE), and universities).

Short Bio

After my PhD in NLP with Prof. Ng, I did a postdoc with Prof. Roth at UIUC. Over the course of my PhD and postdoc, I published 12 main papers at premier NLP conferences (6 ACL, 3 EMNLP, 1 AAAI, 1 IJCAI, 1 COLING), 1 journal, and a few workshop papers. I am the first author in most of the conference papers.

  • In collaboration with David Chiang, our ACL paper showed for the first time that word sense information improves state-of-the-art machine translation performance, helping to settle a 10-year long debate within the academic field.
  • My work in word sense disambiguation and MT evaluation metric had also won international benchmark competitions.

I then worked at Raytheon BBN Technologies for 10 years. Raytheon BBN is a R&D company that was awarded the National Medal of Technology and Innovation in 2013. While at Raytheon BBN, I was serving in multiple parallel capacities for several multi-million NLP projects:

  • Principal investigator in a project funded by DARPA and the Gates Foundation, where we collaborated with external private organizations and universities.
  • R&D lead in a challenging multilingual project where we collaborated with external universities.
  • The creator, architect, and main developer of a deep learning NLP R&D system called NLPLingo used in almost all of Raytheon BBN’s NLP projects, helping to bring in millions of revenue. NLPLingo leverages transformers to perform various NLP tasks such as information extraction, named entity recognition, etc.
  • For 2 years running, I was also selected and served as an expert advisor and judge for Raytheon Innovation Challenge, on topics regarding “AI, ML, and Expert systems for National Security”.

To get out of my comfort zone, I left BBN to join Quattr, which is a startup focusing on applying NLP to tackle Search Engine Optimization (SEO) for corporate clients such as Coursera, Pinterest, Mcafee, etc. At Quattr, I managed the NLP/ML team, where we applied state-of-the-art NLP techniques, transformers, GPT, prompt engineering, etc. to create solutions for search intent discovery, web page topic discovery, automatic content generation, and internal linking among web pages.

I then moved on to Elemental Cognition (EC). EC was founded by original members of the IBM Watson AI team which won the Jeopardy gameshow, and is funded by Bridgewater and other investors. At EC, I leveraged encoders, GPT, prompt engineering and various transformer models to devise NLP solutions for the biomedical and cyber security domains.

Organization of Blog Posts

Check out Transformer Architecture Explained, then go on to the following.

Transformer Basics

Deep Neural Network Basics

Encoder Language Models

Model Size Date Organization Description
BERT 110M, 340M Oct-2018 Google Introduced masked language modeling (MLM)
RoBERTa 125M, 355M Jul-2019 UWash, Meta Replication of BERT with more robust training
XLM XLM-100 has 570M Jan-2019 Meta Cross-lingual model that uses translation language modeling (TLM) and MLM
XLM-R 270M, 550M Nov-2019 Meta Cross-lingual model that uses MLM with more robust training
ELECTRA 110M, 335M Mar-2020 Stanford, Google Introduced “replaced token detection” pretraining, that is more sample efficient than MLM
DeBERTa 100M Oct-2021 Microsoft Keeps the content embeddings separate from the relative position embeddings
DeBERTa-v3 86M+98M Mar-2023 Microsoft Modified the replaced token detection (RTD) objective of ELECTRA, and combined it with the disentangled attention approach of DeBERTa

Decoder Language Models

Model Size Date Organization Description
GPT-1 117M Jun-2018 OpenAI Demonstrated pretraining Transformers enable effective transfer learning to downstream NLP tasks
GPT-2 1.5B Feb-2019 OpenAI Focused on zero-shot in-context prompting
GPT-3 175B Jul-2020 OpenAI Focused on one-shot and few-shot prompting
MPNet 110M Apr-2020 Nanjing University and Microsoft Fused and improved on MLM + PLM for pretraining
FLAN 137B Sept-2021 Google Shows that multitask fine-tuning of LaMDA-PT improves zero-shot generalization to new tasks
GLaM 1.2T Dec-2021 Google Decoder-only language model that does conditional computation using mixture of experts (MoE)
Chinchilla 70B Mar-2022 DeepMind Shows that number of training tokens should scale equally with model size. Outperforms GPT-3 (175B)
PaLM 540B Apr-2022 Google Likely the best decoder-only pretrained model at time of publication
FLAN-PaLM 540B Oct-2022 Google Multitask instruction fine-tuning on PaLM. Likely the best decoder-only model at time of publication, but probably under-trained
U-PaLM 540B Oct-2022 Google Continue training PaLM with the UL2 mixture of denoising pretraining objective
LLaMA-1 6.7B - 65B Feb-2023 Meta Trained on 1.4T tokens of publicly available texts
LLaMA-2 7B - 70B Jul-2023 Meta Instruction fine-tuned and RLHF
Mistral 7B 7B Oct-2023 Mistral.ai Outperforms Llama-2-7B and Llama-2-13B
Zephyr-7B 7B Oct-2023 Hugging Face Starts with Mistral-7B, then instruction fine-tuning, then DPO

Encoder-Decoder Language Models

Model Size Date Organization Description
T5 11B Oct-2019 Google First paper to show text-to-text Transformer achieves SOTA results. Also shows span corruption works well.
BART 406M Oct-2019 Meta Similar to T5. But T5 predicts only the masked spans, whereas BART predicts the complete text
mT5 13B Oct-2020 Google Multilingual version of the T5 model
Switch 3.8B Jan-2021 Google Based on T5, but the original dense FFN is replaced with a sparse Switch FFN layer.
T0 11B Oct-2021 Hugging Face and others Multitask fine-tuning on T5 improves zero-shot performance on unseen tasks. Performs better than GPT-3 (175B)
UL2 20B May-2022 Google Uses a mixture of denoisers for pretraining
Tk-INSTRUCT 11B Apr-2022 University of Washington and others T5 fine-tuned on 1600+ NLP tasks with written instructions

Dialog Models

Model Size Date Organization Description
TransferTransfo ~117M Jan-2019 Hugging Face Open-source dialog model that takes on persona
InstructGPT 175G Sept-2020 OpenAI Leveraged human preferences, reward modeling, and reinforcement learning to improve GPT-3 models. Predecessor of ChatGPT
LaMDA 137B Jan-2022 Google Fine-tune decoder model for quality, safety, and groundedness

Sentence Transformers

Large Scale Evaluation of Language Models

  • SuperGLUE: (May-2019) A Stickier Benchmark for General-Purpose Language Understanding Systems
  • BEIR: (Apr-2021) An aggregation of 18 datasets for zero-shot evaluation of 10 IR models
  • MTEB: (Oct-2022) Benchmark to evaluate sentence-embedding models
  • Holistic Evaluation of Language Models: (Nov-2022) A large scale evaluation of 30 language models over a set of 16 scenarios and 7 categories of metrics

Improve Efficiency of Language Models

Strategies to Improve Language Models

Search and Retrieval

Contact me

chanys.nlp at gmail.com