HOW TO BUILD A GPT MODEL?
What is a GPT model?
GPT stands for Generative Pre-trained Transformer, the first generalized language model in NLP. Previously, language models were only designed for single tasks like text generation, summarization or classification. GPT is the first generalized language model ever created in the history of natural language processing that can be used for various NLP tasks. Now let us explore the three components of GPT, namely Generative, Pre-Trained, and Transformer and understand what they mean.
Generative: Generative models are statistical models used to generate new data. These models can learn the relationships between variables in a data set to generate new data points similar to those in the original data set.
Pre-trained: These models have been pre-trained using a large data set which can be used when it is difficult to train a new model. Although a pre-trained model might not be perfect, it can save time and improve performance.
Transformer: The transformer model, an artificial neural network created in 2017, is the most well-known deep learning model capable of handling sequential data such as text. Many tasks like machine translation and text classification are performed using transformer models.
GPT can perform various NLP tasks with high accuracy depending on the large datasets it was trained on and its architecture of billion parameters, allowing it to understand the logical connections within the data. GPT models, like the latest version GPT-3, have been pre-trained using text from five large datasets, including Common Crawl and WebText2. The corpus contains nearly a trillion words, allowing GPT-3 to perform NLP tasks quickly and without any examples of data.
Working mechanism of GPT models
GPT is an AI language model based on transformer architecture that is pre-trained, generative, unsupervised, and capable of performing well in zero/one/few-shot multitask settings. It predicts the next token (an instance of a sequence of characters) from a sequence of tokens for NLP tasks, it has not been trained on. After seeing only a few examples, it can achieve the desired outcomes in certain benchmarks, including machine translation, Q&A and cloze tasks. GPT models calculate the likelihood of a word appearing in a text given that it appears in another text primarily based on conditional probability. For example, in the sentence, “Margaret is organizing a garage sale…Perhaps we could purchase that old…” the word chair is more likely appropriate than the word ‘elephant’. Also, transformer models use multiple units called attention blocks that learn which parts of a text sequence to be focused on. One transformer might have multiple attention blocks, each learning different aspects of a language.
A transformer architecture has two main segments: an encoder that primarily operates on the input sequence and a decoder that operates on the target sequence during training and predicts the next item. For example, a transformer might take a sequence of English words and predict the French word in the correct translation until it is complete.
The encoder determines which parts of the input should be emphasized. For example, the encoder can read a sentence like “The quick brown fox jumped.” It then calculates the embedding matrix (embedding in NLP allows words with similar meanings to have a similar representation) and converts it into a series of attention vectors. Now, what is an attention vector? You can view an attention vector in a transformer model as a special calculator, which helps the model understand which parts of any given information are most important in making a decision. Suppose you have been asked multiple questions in an exam that you must answer using different information pieces. The attention vector helps you to pick the most important information to answer each question. It works in the same way in the case of a transformer model.
The multi-head attention block initially produces these attention vectors. They are then normalized and passed into a fully connected layer. Normalization is again done before being passed to the decoder. During training, the encoder works directly on the target output sequence. Let us say that the target output is the French translation of the English sentence “The quick brown fox jumped.” The decoder computes separate embedding vectors for each French word of the sentence. Additionally, the positional encoder is applied in the form of sine and cosine functions. Also, masked attention is used, which means that the first word of the French sentence is used, whereas all other words are masked. This allows the transformer to learn to predict the next French words. These outputs are then added and normalized before being passed on to another attention block which also receives the attention vectors generated by the encoder.
Alongside, GPT models employ some data compression while consuming millions upon millions of sample texts to convert words into vectors which are nothing but numerical representations. The language model then unpacks the compressed text into human-friendly sentences. The model’s accuracy is improved by compressing and decompressing text. This also allows it to calculate the conditional probability of each word. GPT models can perform well in “few shots” settings and respond to text samples that have been seen before. They only require a few examples to produce pertinent responses because they have been trained on many text samples.
Besides, GPT models have many capabilities, such as generating unprecedented-quality synthetic text samples. If you prime the model with an input, it will generate a long continuation. GPT models outperform other language models trained on domains such as Wikipedia, news, and books without using domain-specific training data. GPT learns language tasks such as reading comprehension, summarization and question answering from the text alone, without task-specific training data. These tasks’ scores (“score” refers to a numerical value the model assigns to represent the likelihood or probability of a given output or result) are not the best, but they suggest unsupervised techniques with sufficient data and computation that could benefit the tasks.
Here is a comprehensive comparison of GPT models with other language models.
Feature | GPT | BERT (Bidirectional Encoder Representations from Transformers) | ELMo (Embeddings from Language Models) |
---|---|---|---|
Pretraining approach | Unidirectional language modeling | Bidirectional language modeling (masked language modeling and next sentence prediction) | Unidirectional language modeling |
Pretraining data | Large amounts of text from the internet | Large amounts of text from the internet | A combination of internal and external corpus |
Architecture | Transformer network | Transformer network | Deep bi-directional LSTM network |
Outputs | Context-aware token-level embeddings | Context-aware token-level and sentence-level embeddings | Context-aware word-level embeddings |
Fine-tuning approach | Multi-task fine-tuning (e.g., text classification, sequence labeling) | Multi-task fine-tuning (e.g., text classification, question answering) | Fine-tuning on individual tasks |
Advantages | Can generate text, high flexibility in fine-tuning, large model size | Strong performance on a variety of NLP tasks, considering the context in both directions | Generates task-specific features, considers context from the entire input sequence |
Limitations | Can generate biased or inaccurate text, requires large amounts of data | Limited to fine-tuning and requires task-specific architecture modifications; requires large amounts of data | Limited context and task-specific; requires task-specific architecture modifications |