Skip to content
Snippets Groups Projects
Commit 356c02cd authored by Marco Kuhlmann's avatar Marco Kuhlmann
Browse files

Fix the name of the shard directory

parent 6346ee20
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Lab 3: Pretraining a GPT model
%% Cell type:markdown id: tags:
This lab is about pretraining large language models. You will work through the full pretraining process for a GPT model, explore different settings, and implement optimisations that make training more efficient. You will also reflect on the impact of data curation on the quality of the pretrained model. By the end of the lab, you will have a solid understanding of how large language models are trained from scratch.
*Tasks you can choose for the oral exam are marked with the graduation cap 🎓 emoji.*
%% Cell type:code id: tags:
``` python
import math
import os
import time
from dataclasses import dataclass
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from gpt2 import Config, Model
```
%% Cell type:markdown id: tags:
## Part 1: Pretraining pipeline
%% Cell type:markdown id: tags:
The GPT pretraining pipeline builds on the basic training loop for neural language models you have seen before, but includes several enhancements that improve stability and efficiency when training large models:
* It uses the [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimiser with weight decay instead of vanilla stochastic gradient descent.
* It implements a cosine decay learning rate schedule with a linear warmup phase.
* It accumulates gradient updates across multiple batches to allow training with larger effective batch sizes.
* It uses gradient clipping to prevent exploding gradients.
%% Cell type:markdown id: tags:
### Training configuration
%% Cell type:markdown id: tags:
We begin by setting up a configuration object that defines the key parameters of the training process. The original 124M-parameter GPT-2 model was trained on WebText, a private dataset with 300B tokens of Internet data. Our training pipeline is configured to train a Chinchilla-optimal version of the same model using the [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset and a single A100 GPU with 80GB of memory.
%% Cell type:code id: tags:
``` python
@dataclass
class TrainingConfig:
device: torch.device = torch.device("cuda")
shard_dir: str = "data"
shard_dir: str = "/courses/TDDE09/labs/lab3/data"
# Training steps and data processing
n_steps: int = 4768
n_tokens_per_step: int = 524288
batch_size: int = 64
sequence_len: int = 1024
n_vocab: int = 50304
# Optimisation and learning rate scheduling
weight_decay: float = 0.1
max_lr: float = 6e-4
min_lr: float = 6e-5
n_warmup_steps: int = 715
n_decay_steps: int = 4053
betas: tuple[float, float] = (0.9, 0.95)
clip_norm: float = 1.0
```
%% Cell type:markdown id: tags:
#### 🎓 Task 3.01: Explaining the training parameters
Your first task is to explain the purpose of the training parameters. Some of them will already be known to you from the lectures, while others will become clear first as you progress through the lab and see how everything fits together. Because of this, it is best to revisit and complete this task towards the end of the lab when you have a full understanding of the training process.
One parameter to note is the vocabulary size (`n_vocab`). The GPT-2 tokeniser has a default vocabulary size of 50,257, but for training on a GPU, it is helpful to use numbers that are more hardware-friendly. Specifically, numbers with many factors of 2 can lead to more efficient computation. To achieve this, we set the vocabulary size to 50,304, which is slightly larger than needed but has many factors of 2. Note that the extra tokens will not be used in practice — they simply act as placeholders without meaningful embeddings.
%% Cell type:markdown id: tags:
### Model
%% Cell type:markdown id: tags:
Next, we set up the GPT model. Our goal is to match the original training setup of GPT-2 as closely as possible. To do this, we follow the initialisation strategy outlined in the key research papers about GPT ([Radford et al., 2018](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf); [Radford et al., 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf); [Brown et al., 2020](https://arxiv.org/pdf/2005.14165)) as well as the official implementation ([link](https://github.com/openai/gpt-2)). Here is a summary of this strategy:
* Token embeddings → Normal distribution with mean $0$ and variance $0.02$.
* Position embeddings → Normal distribution with mean $0$ and variance $0.01$.
* Weights of the linear layers → Normal distribution with mean $0$ and variance $0.02$.
* Biases of the linear layers → Initialised to zeros.
* Weight sharing between the final linear layer and the token embedding.
%% Cell type:markdown id: tags:
#### 🎈 Task 3.02: Initialising the model
Expand the skeleton code below to create a fresh model and initialise it according to the GPT-2 strategy.
%% Cell type:code id: tags:
``` python
def configure_model(config: TrainingConfig) -> Model:
# TODO: Replace the following line with your own code
return Model(Config(n_vocab=config.n_vocab))
```
%% Cell type:markdown id: tags:
#### Scaled residual initialisation
There is one important detail in the GPT-2 initialisation strategy that we have not addressed yet. [Radford et al. (2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) write (Section 2.3):
> A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of $1/\sqrt{N}$ where $N$ is the number of residual layers.
Why is this necessary? One of the challenges in training large language models is controlling the variance of activations. However, there are two points in the GPT architecture where variance can grow uncontrollably: the residual connections after the multi-head attention and the MLP. Since these connections simply add activations from previous layers, their variances increases with depth. To see this, note that if we sum $N$ independent normally distributed variables with variance $\sigma^2$, the result has variance $N \sigma^2$. The factor $1/\sqrt{N}$ compensates for this growth.
%% Cell type:markdown id: tags:
#### 🎓 Task 3.03: Implementing the scaled residual initialisation
**Step 1.** Suppose each summand in a sum of $N$ independent normally distributed variables with variance $\sigma^2$ is scaled by a factor of $k$. The total variance then becomes $N k^2 \sigma^2$. What happens to the total variance if we choose $k = 1/\sqrt{N}$, as in GPT-2?
**Step 2.** Test the mathematical theory with a simulation. Generate normally distributed activations using `torch.randn()`. Sum the activations across $N$ hypothetical residual layers, first without scaling and then with the $1/\sqrt{N}$ adjustment. Compare the two cases by producing a plot showing the variance at each layer. To compute the variance of a tensor, use `torch.var()`.
**Step 3.** Update your model initialisation from the previous task to include the scaled residual initialisation. The adjustment should only be applied to the linear layer at the end of the multi-head attention and MLP blocks (`c_proj`). Note that a GPT model with $L$ layers has $N = 2 L$ residual layers, because there are two residual connections in each layer (after the multi-head attention and the MLP).
%% Cell type:markdown id: tags:
### Data
%% Cell type:markdown id: tags:
As mentioned before, GPT-2 was pretrained on a 300B non-public dataset collected by OpenAI. Our pretraining data comes from the public [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset. The full dataset contains 1.3 trillion tokens, but here, we will only work with a small sample. We have preprocessed the data by tokenising it with the GPT-2 tokeniser and storing the token indices in equal-sized NumPy arrays, which we call **shards**. The function below loads these shards from a given directory and yields them as PyTorch tensors.
%% Cell type:code id: tags:
``` python
def shards(shard_dir: str):
for shard in sorted(os.listdir(shard_dir)):
yield torch.from_numpy(np.load(os.path.join(shard_dir, shard)).astype(np.int64))
```
%% Cell type:markdown id: tags:
In total, our training data consists of 300M tokens:
%% Cell type:code id: tags:
``` python
sum(s.numel() for s in shards("/courses/TDDE09/labs/lab3/data"))
sum(s.numel() for s in shards(TrainingConfig().shard_dir))
```
%% Cell type:markdown id: tags:
#### 🎓 Task 3.04: Batching the data
Implement a function `make_batches()` that packages the token indices in the shards into pairs of input and output batches suitable for language modelling training. One efficient way to do this is sketched in the cell below:
%% Cell type:code id: tags:
``` python
shard = torch.tensor(range(15), dtype=torch.long)
# Suppose we want to package the tokens in this shard into batches of shape (2, 3).
# Step 1: Segment the shard into overlapping chunks of size 2 * 3 + 1
chunk1 = shard[0:7]
chunk2 = shard[6:13]
excess = shard[12:]
# Step 2: Create batches from all but the last and all but the first token in each chunk
x1, y1 = chunk1[:-1].view(2, 3), chunk1[1:].view(2, 3)
x2, y2 = chunk2[:-1].view(2, 3), chunk2[1:].view(2, 3)
```
%% Cell type:markdown id: tags:
The following code cell shows the signature of `make_batches()`.
%% Cell type:code id: tags:
``` python
def make_batches(config: TrainingConfig):
# TODO: Replace the following line with your own code
yield torch.randn(config.batch_size, config.sequence_len)
```
%% Cell type:markdown id: tags:
**Hints and considerations:**
* Batches can stretch across shards. You will have to carry over excess tokens at the end of shards to the next batch.
* All batches should have the same shape. Drop excess tokens at the end of the last shard.
* The correct number of batches of shape $(64, 1024)$ for the training shards is $4577$.
%% Cell type:markdown id: tags:
### Optimiser
%% Cell type:markdown id: tags:
The code in the cell below configures the [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimiser. It uses weight decay on all parameters with two or more dimensions (e.g., weights in linear layers), and no decay on the remaining parameters. If the model is on a CUDA device, the code uses the “fused” implementation of the optimiser for efficiency.
%% Cell type:code id: tags:
``` python
def configure_optimizer(model: Model, config: TrainingConfig):
params = [p for p in model.parameters() if p.requires_grad]
decay_params = [p for p in params if p.dim() >= 2]
no_decay_params = [p for p in params if p.dim() < 2]
param_groups = [
{"params": decay_params, "weight_decay": config.weight_decay},
{"params": no_decay_params, "weight_decay": 0.0},
]
return torch.optim.AdamW(
param_groups,
lr=config.max_lr,
betas=config.betas,
fused=(config.device.type == "cuda"),
)
```
%% Cell type:markdown id: tags:
#### 🎓 Task 3.05: Exploring the beta parameters
The AdamW optimiser is controlled by two hyperparameters $\beta_1$ and $\beta_2$:
* $\beta_1$ controls the moving average of past gradients.
* $\beta_2$ controls the moving average of past squared gradients, which affects the adaptive learning rate scaling.
Lower values correspond to reduced effect of the respective average, that is, less smoothing and faster adaptation to recent values. In this task, you will explore how different beta values affect the convergence behaviour of the optimiser.
The code below applies the optimiser to the function $f(x) = (x-2)^2 + (y+3)^2$, which has a global minimum at $(2, -3)$. The code yields the trajectories of the parameter values $(x, y)$ visited by Adam when started at $(-4, 5)$ for different values of&nbsp;$\beta_1$ and $\beta_2$.
%% Cell type:code id: tags:
``` python
def adam_trajectories(betas):
def loss_function(x, y):
return (x - 2) ** 2 + (y + 3) ** 2
for beta1, beta2 in betas:
x = torch.tensor([-4.0], requires_grad=True)
y = torch.tensor([5.0], requires_grad=True)
optimizer = torch.optim.AdamW((x, y), lr=0.125, betas=(beta1, beta2))
trajectory = []
for _ in range(100):
optimizer.zero_grad()
loss = loss_function(x, y)
loss.backward()
optimizer.step()
trajectory.append((x.item(), y.item()))
yield trajectory, f"β₁ = {beta1}, β₂ = {beta2}"
```
%% Cell type:markdown id: tags:
Your task is to visualise different trajectories in a plot and analyse the results. Proceed as follows:
* Start by plotting the trajectory for the default values for $\beta_1$ and $\beta_2$. (Consult the PyTorch documentation to find these.)
* Add the plot for the beta values used in our training configuration. For this simple example, do you see a significant difference?
* Add more plots to see what happens if the beta values are too high or too low.
%% Cell type:markdown id: tags:
### Learning rate scheduling
%% Cell type:markdown id: tags:
Your next task is to implement the learning rate scheduler. As mentioned before, GPT-2 is trained with cosine decay from a maximum to a minimum learning rate in conjunction with a linear warmup phase.
%% Cell type:markdown id: tags:
#### 🎓 Task 3.06: Implementing cosine decay
Implement a function `get_lr_factor()` that determines the learning for a given `step`. To work with the rest of the implementation, this function should return its result as a factor of the maximum learning rate. In particular, the result at the end of the linear warmup should be&nbsp;$1$. Validate your implementation by plotting the function against the training steps.
%% Cell type:code id: tags:
``` python
def get_lr_factor(step: int, config: TrainingConfig) -> float:
# TODO: Replace the next line with your own code
return 1.0
```
%% Cell type:markdown id: tags:
**Hint:** For $x \in [0, 1]$, the relevant cosine decay is described by the term $\frac{1}{2} \cdot (1 + \text{cos}(\pi x))$.
%% Cell type:markdown id: tags:
### Training loop
%% Cell type:markdown id: tags:
At this point, we have everything in place to put together a first version of the training loop. Here it is:
%% Cell type:code id: tags:
``` python
def train(config: TrainingConfig):
model = configure_model(config)
model = model.to(config.device)
batches = make_batches(config)
optimizer = configure_optimizer(model, config)
scheduler = torch.optim.lr_scheduler.LambdaLR(
optimizer,
lambda lr: get_lr_factor(lr, config),
)
n_micro_steps = config.n_tokens_per_step // (
config.batch_size * config.sequence_len
)
for step in range(config.n_steps):
model.train()
optimizer.zero_grad()
running_loss = 0.0
for micro_step in range(n_micro_steps):
x, y = next(batches)
x, y = x.to(config.device), y.to(config.device)
logits = model(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
loss.backward()
running_loss += loss.item()
nn.utils.clip_grad_norm_(model.parameters(), config.clip_norm)
optimizer.step()
lr = scheduler.get_last_lr()[0]
scheduler.step()
print(f"step {step:4d} | loss: {running_loss:.4f} | lr: {lr:.4e}")
```
%% Cell type:markdown id: tags:
Most of the steps should look familiar from previous labs or earlier tasks in this notebook. However, a few key aspects will be new:
**Moving the model and data to the training device.** The model and each batch are moved to the training device using `.to()`. In our training configuration, `config.device` is an NVIDIA GPU, which supports fast tensor computations.
**Gradient accumulation.** Instead of taking an optimisation step after every batch, we accumulate gradients over multiple batches (“micro-steps”). Each batch contributes to the gradients using `loss.backward()`.
**Gradient clipping.** Before updating weights, the gradients are clipped using `clip_grad_norm_()`. This prevents excessively large updates that could destabilise training. The clipping threshold is set by `config.clip_norm`.
%% Cell type:markdown id: tags:
#### 🎓 Task 3.07: Fixing the training loop
Try to train a model by executing the code below:
%% Cell type:code id: tags:
``` python
train(TrainingConfig())
```
%% Cell type:markdown id: tags:
You will run into two problems:
**Memory issues.** As mentioned above, the training has been set up for an A100 GPU with 80GB of memory. If you are using a less powerful GPU, you will see an out-of-memory error. To fix this, reduce the batch size to lower the memory load on the GPU. Start by halving the batch size and keep adjusting it until you find the largest batch size that fits on your GPU. (You may have to restart the Jupyter kernel to reset the GPU.)
**High losses.** The training losses start at very high values. Recall that the model’s goal is to predict the next token. At the start, the model’s weights are random, so we expect it to output a uniform distribution over the vocabulary. This means each token should have a probability of $1/V$, where $V$ is the vocabulary size. Given this, what should the initial loss be? Think about the cross-entropy loss for a uniform distribution.
(You will fix the problem with the high losses in the next task.)
%% Cell type:markdown id: tags:
#### 🎓 Task 3.08: Fixing the gradient accumulation
The gradient accumulation in the `train()` function is not implemented correctly. To see this, consider the following example. We set up a linear layer and pass in some random input of shape $[2, 3]$. First, we compute the loss and gradients in the normal way. Then, we do the same thing using gradient accumulation over two singleton batches, in the way this is implemented in `train()`. As you will see, the two outputs are different.
%% Cell type:code id: tags:
``` python
def accumulation_example():
# Set up a simple model and simulate some input
model = nn.Linear(3, 2)
x = torch.randn(2, 3)
y = torch.randint(0, 2, (2,))
# Compute the gradient on the complete input
model.zero_grad()
output = model(x)
loss = F.cross_entropy(output, y)
loss.backward()
print(model.weight.grad)
# Compute the gradient using micro-batches (flawed)
model.zero_grad()
for i in range(2):
output = model(x[i : i + 1])
loss = F.cross_entropy(output, y[i : i + 1])
loss.backward()
print(model.weight.grad)
accumulation_example()
```
%% Cell type:markdown id: tags:
Your task is to fix the flawed implementation of gradient accumulation in the training loop.
1. Propose a fix for the problem illustrated in the example.
2. Validate your proposal by modifying the example.
3. Once you are convinced that your fix is correct, apply it to the training loop.
4. How does the fix affect the loss?
%% Cell type:markdown id: tags:
## Part 2: Efficiency optimisations
%% Cell type:markdown id: tags:
Training large language models from scratch is a computationally intensive task. Efficient training not only speeds up model convergence but also reduces hardware costs and energy consumption. As models grow in size, optimising the training process becomes increasingly important to ensure that resources are used effectively. In the second part of this lab, we will look into a few such optimisation techniques.
%% Cell type:markdown id: tags:
#### 🎓 Task 3.09: Profiling the training loop
Before we explore optimisations, we first need a way to measure training speed. We will define training speed as the **number of tokens processed per second**. Your task is to implement this measurement in the training loop.
**Step&nbsp;1.** Keep track of the number of tokens processed in each optimisation step.
**Step&nbsp;2.** Add a timer to measure the time taken per step. For accurate results, call `torch.cuda.synchronize()` before stopping the timer. This ensures all GPU computations finish before recording the time.
Once you have added this measurement, train the model for a few steps and answer the following questions:
* Based on your measured training speed, how long would it take to train on all data (300M tokens)?
* How much data and time would be required to train a Chinchilla-optimal version of the model?
%% Cell type:markdown id: tags:
### Floating-point representations
%% Cell type:markdown id: tags:
One of the optimisations we can make when training large language models is choosing an appropriate floating-point representation. A floating-point number is represented using three components: a **sign bit**, an **exponent**, and a **fraction** (also called mantissa). The number of bits assigned to the exponent determines the *range* of the representation, while the number of bits in the fraction determines the *precision*.
The most commonly used floating-point format is single-precision floating point (fp32), which uses 1&nbsp;sign bit, 8&nbsp;exponent bits, and 23&nbsp;fraction bits. However, most deep learning computations do not require the full 23-bit precision of fp32. As a result, modern hardware supports lower-precision formats that improve performance and memory efficiency while maintaining training stability. We are particularly interested in two of those:
**TensorFloat-32 (tf32)** is a precision format introduced by NVIDIA. It keeps the same 8-bit exponent as fp32, preserving the same numerical range. However, tf32 reduces the fraction size to 10 bits, leading to lower precision but faster matrix multiplications on dedicated GPUs.
**Brain Floating Point 16 (bf16)** is another reduced precision format widely used in NVIDIA GPUs. Like tf32, bf16 has the same 8-bit exponent as fp32, and therefore, the same range. However, bf16 has only 7 bits for the fraction. Unlike tf32, which is primarily used for internal computations, bf16 can be directly stored and used for activations, weights, and gradients, which can save memory (16 bits per value instead of 32).
%% Cell type:markdown id: tags:
#### 🎈 Task 3.10: Exploring floating-point representations
Your task is to rewrite the training loop to take advantage of specialised floating-point representations.
**Step&nbsp;1.** By default, PyTorch uses the fp32 format for internal computations (“highest precision”). Read the documentation of [`torch.set_float32_matmul_precision()`](https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html) to find out how to change this default and use tf32 if possible.
**Step&nbsp;2.** Computing the forward pass and the loss requires even less precision than other parts of the training loop. Read the documentation of [`torch.autocast()`](https://pytorch.org/docs/stable/amp.html#torch.autocast) to find out how to execute these operations using the bf16 format.
**Step&nbsp;3.** Repeat your profiling experiments with the modified training loop. How much time would be required to train the model now that you have implemented the floating-point optimisations?
%% Cell type:markdown id: tags:
### Just-in-time compilation
%% Cell type:markdown id: tags:
The second optimisation we will explore is **just-in-time (JIT) compilation** – a technique to speed up code by compiling it at runtime, rather than before execution. This allows the compiler to optimise the code dynamically based on actual inputs and hardware conditions.
PyTorch provides a function [`torch.compile()`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) that uses JIT compilation to automatically optimise deep learning models. Instead of executing PyTorch operations one by one, this function analyses the computation, restructures it, and generates highly efficient machine code. This optimised code can often be run as a single fused operation on the GPU.
The performance boost from JIT compilation is especially noticeable on high-end GPUs like the NVIDIA A100 or H100, but it can still provide smaller speedups on less powerful GPUs.
%% Cell type:markdown id: tags:
#### 🎈 Task 3.11: Exploring just-in-time compilation
In this task, you will measure how much `torch.compile()` improves the training speed of the GPT-2 model. To enable JIT compilation, add the following line to the training loop:
```
model = torch.compile(model)
```
Make sure to place this *after* moving the model to the training device so the optimisation can be tailored to the hardware you are using. Once you have enabled JIT compilation, rerun your profiling experiments and compare the training speed before and efter.
%% Cell type:markdown id: tags:
## Part 3: Evaluate the pre-trained model
%% Cell type:markdown id: tags:
This notebook contains all the code needed to train a GPT-2 model on the FineWeb-Edu dataset. However, training on the full dataset is not practical within the time and resource limits of this lab.
To save time, we provide a pretrained model that was trained using the exact same settings you developed earlier. The only difference is that we used a compute node with 8 NVIDIA A100 GPUs (80GB each). With this setup, a full Chinchilla-optimal training run takes about 30 minutes.
We provide the trained model in the file `gpt-2-fineweb-edu.pt`. You can load it with the following code:
%% Cell type:code id: tags:
``` python
pretrained = Model(Config(n_vocab=50304))
pretrained.load_state_dict(torch.load("gpt-2-fineweb-edu.pt"))
pretrained.lm_head.weight = pretrained.wte.weight
```
%% Cell type:markdown id: tags:
#### 🎓 Task 3.12: Evaluating the pretrained model
The next cell contains code for evaluating the FineWeb-Edu model on the same small sample from the [HellaSwag](https://rowanzellers.com/hellaswag/) benchmark you already used in lab&nbsp;2. How does its score compare to that of the original GPT-2 model? What does this result tell you about the impact of data quality on the downstream performance of language models?
%% Cell type:code id: tags:
``` python
import json
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
with open("hellaswag-mini.jsonl") as f:
n_correct = 0
n_total = 0
for line in f:
sample = json.loads(line)
prefix = tokenizer.encode(sample["ctx"])
ending_scores = []
for i, ending in enumerate(sample["endings"]):
suffix = tokenizer.encode(" " + ending)
context = torch.tensor([prefix + suffix], dtype=torch.long)
with torch.no_grad():
logits = pretrained(context)
ending_score = torch.nn.functional.cross_entropy(
logits[0, -len(suffix) - 1 : -1], context[0, -len(suffix) :]
)
ending_scores.append((ending_score, i))
predicted = min(ending_scores)[1]
n_correct += int(predicted == sample["label"])
n_total += 1
print(f"Accuracy: {n_correct / n_total:.2%}")
```
%% Cell type:markdown id: tags:
**🥳 Congratulations on finishing lab&nbsp;3!**
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment