Generative pre-trained transformer

Here are the key points about GPT (Generative Pre-trained Transformer):

GPT was created by OpenAI in 2018 as a unidirectional language model based on the transformer architecture.
While BERT is bidirectional, GPT is unidirectional - it reads the text input sequentially from left to right to predict the next word.
GPT is pretrained on massive text corpora to learn linguistic context and patterns useful for natural language generation tasks.
The pretraining task is language modeling - predicting the next word given all previous words in sentences from books and Wikipedia articles.
GPT uses 12 stacked transformer decoder blocks with 12 self-attention heads and 768 dimensional embeddings per token. GPT-2 and GPT-3 scale up these parameters.
Fine-tuning GPT on downstream datasets achieves excellent performance on text generation tasks like machine translation, summarization, and question answering.
GPT pioneered the generative pretraining approach for NLP, showing the ability of transformers to model language for zero-shot and few-shot learning.
GPT-1 trained on a dataset of 8 million web pages to predict the next word in a sequence based on previous words, 110 million parameters.
GPT-2 with 1.5 billion parameters showed strong textual coherence and topic modeling ability in zero-shot settings.
GPT-3 with 175 billion parameters achieves impressive few-shot learning across many NLP datasets and tasks like translation and question answering.
GPT-4 with 100 trillion parameters, or more than 100 times the number of parameters in GPT-3.
Unlike BERT's bidirectional training, GPT's unidirectional approach is more suited for generative text modeling and auto-regressive generation.

In summary, GPT demonstrated the pretraining power of transformers for language modeling and generative text tasks, pioneering foundational concepts later used by models like GPT-2 and GPT-3.