"The statistical bigram model is not very good at generating names. One way to improve it is to condition its probabilities on larger contexts beyond the immediately preceding character. However, this exponentially inflates the model’s parameter count. For instance, with a 51-character vocabulary, a statistical bigram model encompasses 2,601 probabilities, while a 3-gram model jumps to 132,651, and a 4-gram model has 6,765,201 probabilities.\n",
"\n",
"In this notebook we will build an alternative bigram model grounded in neural networks. Instead of relying on direct probability estimates, we will train a softmax classifier that predicts the succeeding character based on the preceding one. While this model does not outperform the statistical bigram model, it scales better to longer contexts and larger vocabularies. Additionally, the neural architecture offers greater flexibility."
]
},
{
"cell_type": "markdown",
"id": "3943d2ea",
"metadata": {},
"source": [
"## Dataset"
]
},
{
"cell_type": "markdown",
"id": "e89dd769",
"metadata": {},
"source": [
"We re-use the names dataset and our code for building the character index."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7c6290e1",
"metadata": {},
"outputs": [],
"source": [
"# Load the list of names\n",
"with open('names.txt', encoding='utf-8') as fp: \n",
" names = [line.rstrip() for line in fp]\n",
"\n",
"# Build the character index\n",
"char2idx = {'$': 0}\n",
"for name in names:\n",
" for char in name:\n",
" if char not in char2idx:\n",
" char2idx[char] = len(char2idx)"
]
},
{
"cell_type": "markdown",
"id": "8f792f99",
"metadata": {},
"source": [
"This time, we store all bigrams in a list:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "41030442",
"metadata": {},
"outputs": [],
"source": [
"all_bigrams = []\n",
"for name in names:\n",
" for bigram in zip('$' + name, name + '$'):\n",
" all_bigrams.append(bigram)"
]
},
{
"cell_type": "markdown",
"id": "678e1c8b",
"metadata": {},
"source": [
"The total number of bigrams in our data is 228,887:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "573a279e",
"metadata": {},
"outputs": [],
"source": [
"len(all_bigrams)"
]
},
{
"cell_type": "markdown",
"id": "d319a3bf",
"metadata": {},
"source": [
"> **🤔 Problem 1: Total number of bigrams**\n",
">\n",
"> Explain how to obtain the total number of bigrams in the data from the matrix we computed in the notebook on the statistical bigram model."
]
},
{
"cell_type": "markdown",
"id": "828907a0",
"metadata": {},
"source": [
"## One-hot representation"
]
},
{
"cell_type": "markdown",
"id": "c259be8e",
"metadata": {},
"source": [
"Neural networks require us to represent our data as vectors, matrices, and tensors. Here, we opt to represent preceding characters using **one-hot vectors**. The one-hot vector for a character $c_i$ has as many components as there are elements in our vocabulary, and is zero everywhere except in component $i$, where it takes the value one.\n",
"\n",
"The next cell defines a function `vectorize` that takes a list of bigrams and returns a matrix $X$ and a vector $y$. The matrix $X$ has one row for each bigram, and that row is the one-hot vector for the first character of the corresponding bigram. The vector $y$ contains the integer indexes of the corresponding second characters."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a9521c11",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import torch.nn.functional as F\n",
"\n",
"def vectorize(bigrams):\n",
" # Split the batch into inputs (previous characters) and outputs (next characters)\n",
" xs, ys = zip(*bigrams)\n",
" \n",
" # Replace each character by its integer index\n",
"The code in the next cell computes the vectorised representation of the first three bigrams:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "695e1e55",
"metadata": {},
"outputs": [],
"source": [
"vectorize(all_bigrams[:3])"
]
},
{
"cell_type": "markdown",
"id": "3ae4c625",
"metadata": {},
"source": [
"> **🤔 Problem 2: One-hot representation**\n",
">\n",
"> Suppose we compute the vector representation for all bigrams in our dataset; this returns an $m$-by-$n$ matrix $X$ and an $m$-dimensional vector $y$. What are the values of $m$ and $n$?"
]
},
{
"cell_type": "markdown",
"id": "570e92bb",
"metadata": {},
"source": [
"## Training the model"
]
},
{
"cell_type": "markdown",
"id": "94ef53e3",
"metadata": {},
"source": [
"We are now ready to set up a softmax regression model and train it using cross-entropy loss.\n",
"\n",
"To refresh, a softmax regression model consists of a linear layer followed by the softmax function. In PyTorch, we can employ the [`nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) class for linear layers and use the [`nn.functional.cross_entropy()`](https://pytorch.org/docs/stable/nn.functional.html#cross-entropy) function to compute the cross-entropy loss. It is crucial to note that this function requires pre-softmax logits as input, not probabilities. As a consequence, we do not actually need the softmax function for training the model.\n",
"\n",
"We train our model using gradient descent using the [`utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) abstraction. Iterating over a data loader yields small batches of the underlying dataset – in our case, batches of bigrams. We configure the data loader to send each such batch through the one-hot vectoriser we defined above using the `collate_fn` keyword."
"By looking at the loss curve, we can confirm that the network is actually learning to predict the next character from the previous one."
]
},
{
"cell_type": "markdown",
"id": "a72af016",
"metadata": {},
"source": [
"## Evaluation"
]
},
{
"cell_type": "markdown",
"id": "25d8844b",
"metadata": {},
"source": [
"The code in the next cell computes the perplexity of our neural model. The code is considerably simpler than in the notebook on the statistical bigram model because the average negative log likelihood we need to compute as a first step is exactly the cross-entropy loss."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "509e8faa",
"metadata": {},
"outputs": [],
"source": [
"# Do the following without gradient calculation\n",
"with torch.no_grad():\n",
"\n",
" # Get the vectorised version of all bigrams\n",
" test_x, test_y = vectorize(all_bigrams)\n",
"\n",
" # Compute the cross-entropy loss (= average negative log likelihood)\n",
" loss = F.cross_entropy(model.forward(test_x), test_y)\n",
"\n",
" # Print as perplexity\n",
" print(f'{torch.exp(loss):.1f}')"
]
},
{
"cell_type": "markdown",
"id": "e75bd15b",
"metadata": {},
"source": [
"As we can see, the perplexity of the neural bigram model is only slightly higher than that of its statistical counterpart."
]
},
{
"cell_type": "markdown",
"id": "75b9721d",
"metadata": {},
"source": [
"> **🤔 Problem 3: Improving the perplexity**\n",
">\n",
"> Try to improve the perplexity of the model by changing the training hyperparameters: number of epochs, batch size, learning rate. Building on your previous machine learning knowledge, do you have other ideas for how you could improve perplexity?"
]
},
{
"cell_type": "markdown",
"id": "4d539eb6",
"metadata": {},
"source": [
"## Sampling from the model"
]
},
{
"cell_type": "markdown",
"id": "b601f8c2",
"metadata": {},
"source": [
"Finally, we present code that generates text by repeatedly sampling from the neural model. The general structure of this code is very similar to that of the corresponding code for the statistical bigram model; the main differences are in the input to and output from the model. Note that in this code we actually use the softmax function (implemented in [`nn.functional.softmax()`](https://pytorch.org/docs/stable/nn.functional.html#cross-entropy)) to convert the logits into proper probabilities."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d6471c00",
"metadata": {},
"outputs": [],
"source": [
"# Construct the index-to-character mapping\n",
"idx2char = {i: c for c, i in char2idx.items()}\n",
"\n",
"# Do the following without gradient calculation\n",
"with torch.no_grad():\n",
"\n",
" # Generate 5 samples\n",
" for _ in range(5):\n",
"\n",
" # We begin with the start-of-sequence marker\n",
" generated = '$'\n",
"\n",
" while True:\n",
" # Look up the integer index of the previous character\n",
" previous_char = [char2idx[generated[-1]]]\n",
"\n",
" # Construct the corresponding one-hot vector\n",
" x = F.one_hot(torch.LongTensor(previous_char), len(char2idx)).float()\n",
"\n",
" # Get the logits (negative log probabilities)\n",
" logits = model.forward(x)\n",
"\n",
" # Turn the logits into a probability distribution\n",
" p = F.softmax(logits, dim=-1)\n",
"\n",
" # Sample from the distribution\n",
" y = torch.multinomial(p, num_samples=1).item()\n",
"\n",
" # Get the corresponding character\n",
" next_char = idx2char[y]\n",
"\n",
" # Break if the model generates the end-of-sequence marker\n",
" if next_char == '$':\n",
" break\n",
"\n",
" # Add the next character to the output\n",
" generated = generated + next_char\n",
"\n",
" # Print the generated output (without the start-of-sequence marker)\n",
" print(generated[1:])"
]
},
{
"cell_type": "markdown",
"id": "b646cc6d",
"metadata": {},
"source": [
"> **🤔 Problem 4: Sampling with temperature**\n",
">\n",
"> We can refine sampling by introducing a hyperparameter called **temperature**. This setting controls the randomness of the model output and prevents the model from replicating patterns in the training data too closely. Practically, we divide the logits by a value $T > 0$ before sending them through the softmax function. What is the effect of that? How do the generated names change when you introduce temperature into the sampling procedure?"
]
},
{
"cell_type": "markdown",
"id": "3fc23d92",
"metadata": {},
"source": [
"That’s all folks!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
%% Cell type:markdown id:a9d669c5 tags:
# Neural bigram model
%% Cell type:markdown id:58ccfcd7 tags:
The statistical bigram model is not very good at generating names. One way to improve it is to condition its probabilities on larger contexts beyond the immediately preceding character. However, this exponentially inflates the model’s parameter count. For instance, with a 51-character vocabulary, a statistical bigram model encompasses 2,601 probabilities, while a 3-gram model jumps to 132,651, and a 4-gram model has 6,765,201 probabilities.
In this notebook we will build an alternative bigram model grounded in neural networks. Instead of relying on direct probability estimates, we will train a softmax classifier that predicts the succeeding character based on the preceding one. While this model does not outperform the statistical bigram model, it scales better to longer contexts and larger vocabularies. Additionally, the neural architecture offers greater flexibility.
%% Cell type:markdown id:3943d2ea tags:
## Dataset
%% Cell type:markdown id:e89dd769 tags:
We re-use the names dataset and our code for building the character index.
%% Cell type:code id:7c6290e1 tags:
``` python
# Load the list of names
withopen('names.txt',encoding='utf-8')asfp:
names=[line.rstrip()forlineinfp]
# Build the character index
char2idx={'$':0}
fornameinnames:
forcharinname:
ifcharnotinchar2idx:
char2idx[char]=len(char2idx)
```
%% Cell type:markdown id:8f792f99 tags:
This time, we store all bigrams in a list:
%% Cell type:code id:41030442 tags:
``` python
all_bigrams=[]
fornameinnames:
forbigraminzip('$'+name,name+'$'):
all_bigrams.append(bigram)
```
%% Cell type:markdown id:678e1c8b tags:
The total number of bigrams in our data is 228,887:
%% Cell type:code id:573a279e tags:
``` python
len(all_bigrams)
```
%% Cell type:markdown id:d319a3bf tags:
> **🤔 Problem 1: Total number of bigrams**
>
> Explain how to obtain the total number of bigrams in the data from the matrix we computed in the notebook on the statistical bigram model.
%% Cell type:markdown id:828907a0 tags:
## One-hot representation
%% Cell type:markdown id:c259be8e tags:
Neural networks require us to represent our data as vectors, matrices, and tensors. Here, we opt to represent preceding characters using **one-hot vectors**. The one-hot vector for a character $c_i$ has as many components as there are elements in our vocabulary, and is zero everywhere except in component $i$, where it takes the value one.
The next cell defines a function `vectorize` that takes a list of bigrams and returns a matrix $X$ and a vector $y$. The matrix $X$ has one row for each bigram, and that row is the one-hot vector for the first character of the corresponding bigram. The vector $y$ contains the integer indexes of the corresponding second characters.
%% Cell type:code id:a9521c11 tags:
``` python
importtorch
importtorch.nn.functionalasF
defvectorize(bigrams):
# Split the batch into inputs (previous characters) and outputs (next characters)
The code in the next cell computes the vectorised representation of the first three bigrams:
%% Cell type:code id:695e1e55 tags:
``` python
vectorize(all_bigrams[:3])
```
%% Cell type:markdown id:3ae4c625 tags:
> **🤔 Problem 2: One-hot representation**
>
> Suppose we compute the vector representation for all bigrams in our dataset; this returns an $m$-by-$n$ matrix $X$ and an $m$-dimensional vector $y$. What are the values of $m$ and $n$?
%% Cell type:markdown id:570e92bb tags:
## Training the model
%% Cell type:markdown id:94ef53e3 tags:
We are now ready to set up a softmax regression model and train it using cross-entropy loss.
To refresh, a softmax regression model consists of a linear layer followed by the softmax function. In PyTorch, we can employ the [`nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) class for linear layers and use the [`nn.functional.cross_entropy()`](https://pytorch.org/docs/stable/nn.functional.html#cross-entropy) function to compute the cross-entropy loss. It is crucial to note that this function requires pre-softmax logits as input, not probabilities. As a consequence, we do not actually need the softmax function for training the model.
We train our model using gradient descent using the [`utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) abstraction. Iterating over a data loader yields small batches of the underlying dataset – in our case, batches of bigrams. We configure the data loader to send each such batch through the one-hot vectoriser we defined above using the `collate_fn` keyword.
%% Cell type:code id:fd1fc6d4 tags:
``` python
importtorch.optimasoptim
importtorch.nnasnn
fromtorch.utils.dataimportDataLoader
# Hyperparameters
num_epochs=8
batch_size=8
lr=1e-1
# Initialise the model
model=nn.Linear(len(char2idx),len(char2idx))
# Initialise the optimiser (stochastic gradient descent)
By looking at the loss curve, we can confirm that the network is actually learning to predict the next character from the previous one.
%% Cell type:markdown id:a72af016 tags:
## Evaluation
%% Cell type:markdown id:25d8844b tags:
The code in the next cell computes the perplexity of our neural model. The code is considerably simpler than in the notebook on the statistical bigram model because the average negative log likelihood we need to compute as a first step is exactly the cross-entropy loss.
%% Cell type:code id:509e8faa tags:
``` python
# Do the following without gradient calculation
withtorch.no_grad():
# Get the vectorised version of all bigrams
test_x,test_y=vectorize(all_bigrams)
# Compute the cross-entropy loss (= average negative log likelihood)
As we can see, the perplexity of the neural bigram model is only slightly higher than that of its statistical counterpart.
%% Cell type:markdown id:75b9721d tags:
> **🤔 Problem 3: Improving the perplexity**
>
> Try to improve the perplexity of the model by changing the training hyperparameters: number of epochs, batch size, learning rate. Building on your previous machine learning knowledge, do you have other ideas for how you could improve perplexity?
%% Cell type:markdown id:4d539eb6 tags:
## Sampling from the model
%% Cell type:markdown id:b601f8c2 tags:
Finally, we present code that generates text by repeatedly sampling from the neural model. The general structure of this code is very similar to that of the corresponding code for the statistical bigram model; the main differences are in the input to and output from the model. Note that in this code we actually use the softmax function (implemented in [`nn.functional.softmax()`](https://pytorch.org/docs/stable/nn.functional.html#cross-entropy)) to convert the logits into proper probabilities.
%% Cell type:code id:d6471c00 tags:
``` python
# Construct the index-to-character mapping
idx2char={i:cforc,iinchar2idx.items()}
# Do the following without gradient calculation
withtorch.no_grad():
# Generate 5 samples
for_inrange(5):
# We begin with the start-of-sequence marker
generated='$'
whileTrue:
# Look up the integer index of the previous character
# Break if the model generates the end-of-sequence marker
ifnext_char=='$':
break
# Add the next character to the output
generated=generated+next_char
# Print the generated output (without the start-of-sequence marker)
print(generated[1:])
```
%% Cell type:markdown id:b646cc6d tags:
> **🤔 Problem 4: Sampling with temperature**
>
> We can refine sampling by introducing a hyperparameter called **temperature**. This setting controls the randomness of the model output and prevents the model from replicating patterns in the training data too closely. Practically, we divide the logits by a value $T > 0$ before sending them through the softmax function. What is the effect of that? How do the generated names change when you introduce temperature into the sampling procedure?