Generative pre-trained transformer
Here are the key points about GPT (Generative Pre-trained Transformer):
- GPT was created by OpenAI in 2018 as a unidirectional language model based on the transformer architecture.
- While BERT is bidirectional, GPT is unidirectional - it reads the text input sequentially from left to right to predict the next word.
- GPT is pretrained on massive text corpora to learn linguistic context and patterns useful for natural language generation tasks.
- The pretraining task is language modeling - predicting the next word given all previous words in sentences from books and Wikipedia articles.
- GPT uses 12 stacked transformer decoder blocks with 12 self-attention heads and 768 dimensional embeddings per token. GPT-2 and GPT-3 scale up these parameters.
- Fine-tuning GPT on downstream datasets achieves excellent performance on text generation tasks like machine translation, summarization, and question answering.
- GPT pioneered the generative pretraining approach for NLP, showing the ability of transformers to model language for zero-shot and few-shot learning.
- GPT-1 trained on a dataset of 8 million web pages to predict the next word in a sequence based on previous words, 110 million parameters.
- GPT-2 with 1.5 billion parameters showed strong textual coherence and topic modeling ability in zero-shot settings.
- GPT-3 with 175 billion parameters achieves impressive few-shot learning across many NLP datasets and tasks like translation and question answering.
- GPT-4 with 100 trillion parameters, or more than 100 times the number of parameters in GPT-3.
- Unlike BERT's bidirectional training, GPT's unidirectional approach is more suited for generative text modeling and auto-regressive generation.
In summary, GPT demonstrated the pretraining power of transformers for language modeling and generative text tasks, pioneering foundational concepts later used by models like GPT-2 and GPT-3.