Skip to content
Snippets Groups Projects
Commit 1da138c6 authored by Marco Kuhlmann's avatar Marco Kuhlmann
Browse files

Update the notebook on the statistical bigram model

parent d45a7dc8
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:3b684dd6 tags:
# Statistical bigram model
%% Cell type:markdown id:f5a8f0f1 tags:
In this notebook, we will build our first language model. Large language models such as GPT-4 generate text as a sequence of words. The language model presented here is very small and generates text as a sequence of individual characters. Also, it is not built on a neural network but is based on probabilities defined on pairs of adjacent characters. Because of this, it is known as a **statistical bigram model**.
%% Cell type:markdown id:752b28ad tags:
## Dataset
%% Cell type:markdown id:fa32d3b6 tags:
The task we will address in this notebook is to generate person names. The data for this task comes as a text file containing Swedish first names (*tilltalsnamn*). More specifically, the file lists the most frequent names as of 2022-12-31 in decreasing order of frequency. The raw data was obtained from [Statistics Sweden](https://www.scb.se/) and postprocessed by lowercasing each name.
We start by opening the file and store its contents in a Python list:
%% Cell type:code id:64664b0b tags:
``` python
with open('names.txt', encoding='utf-8') as fp:
names = [line.rstrip() for line in fp]
with open("names.txt", encoding="utf-8") as f:
names = [line.rstrip() for line in f]
```
%% Cell type:markdown id:c878a612 tags:
Here are the five most frequent names:
%% Cell type:code id:eb5b1491 tags:
``` python
names[:5]
```
%% Cell type:markdown id:1889ff8e tags:
In total, we have 32K names:
%% Cell type:code id:255dd353 tags:
``` python
len(names)
```
%% Cell type:markdown id:134a9c3b tags:
> **🤔 Problem 1: What’s in the data?**
>
> It is important to engage with the data you are working with. One way to do so is to ask questions. Is your own name included in the dataset? Is your name frequent or rare? Can you tell from a name whether the person with that name is male or female? Can you tell whether they are immigrants? What would be ethical and unethical uses of systems that *can* tell this?
#### 🎈 Task 1: What’s in the data?
It is important to engage with the data you are working with. One way to do so is to ask questions. Is your own name included in the dataset? Is your name frequent or rare? Can you tell from a name whether the person with that name is male or female? Can you tell whether they are immigrants? What would be ethical and unethical uses of systems that *can* tell this?
%% Cell type:markdown id:5a16af3f tags:
## Character-to-index mapping
%% Cell type:markdown id:1377cd22 tags:
We create a string-to-integer mapping from the characters (letters) in the names:
%% Cell type:code id:60e6cc4b tags:
``` python
char2idx = {'$': 0}
char2idx = {"$": 0}
for name in names:
for char in name:
if char not in char2idx:
char2idx[char] = len(char2idx)
```
%% Cell type:markdown id:3cd9b820 tags:
Note that we reserve a special character `$` with index 0. We will use this character to mark the start and the end of a sequence of characters.
%% Cell type:markdown id:b7d4cf23 tags:
> **🤯 What does this code do?**
>
> Throughout the course, we often present code without detailed explanations – we assume you can follow along even so. If you need help understanding code, you can often get useful explanations from AI assistant services. For example, the following explanation of the above code was generated by ChatGPT 3.5 (and lightly edited) as a reponse to the prompt *Explain the following code*.
> Throughout the course, we often present code without detailed explanations – we assume you can follow along even so. If you need help understanding code, you can often get useful explanations from AI assistant services such as [ChatGPT](https://chatgpt.com/) and [Copilot](https://copilot.microsoft.com/). For example, the following explanation of the above code was generated by ChatGPT 3.5 (and lightly edited) as a reponse to the prompt *Explain the following code*.
>
> “The code begins by initializing a dictionary called `char2idx` with a special character `$` mapped to the index 0. It then iterates through a collection of names, and further through each character in the name. Inside the nested loop, there is a conditional statement that checks if the character is already present in the dictionary. If it is not, the code assigns a unique index to that character. The index is determined by the current length of the dictionary, effectively assigning consecutive indices starting from 1 (since `$` already occupies index 0).”
%% Cell type:markdown id:54bfa279 tags:
> **🤔 Problem 2: A look inside the vocabulary**
>
> Write code that prints the vocabulary for the names dataset. Would you have expected this vocabulary? Would you expect the same vocabulary for a list of English names? What would you expect regarding the frequency distribution of the characters?
#### 🎈 Task 2: A look inside the vocabulary
Write code that prints the vocabulary for the names dataset. Would you have expected this vocabulary? Would you expect the same vocabulary for a list of English names? What would you expect regarding the frequency distribution of the characters?
%% Cell type:markdown id:356723cc tags:
## Bigrams
%% Cell type:markdown id:3969b408 tags:
As already mentioned, our model is based on pairs of contiguous characters. Such pairs are called **bigrams**. For example, the bigrams in the name *anna* are *an*, *nn*, and *na*. In addition to the standard characters, we also include the start and end marker `$`.
As already mentioned, our model is based on pairs of consecutive characters. Such pairs are called **bigrams**. For example, the bigrams in the name *anna* are *an*, *nn*, and *na*. In addition to the standard characters, we also include the start and end marker `$`.
The code in the following cell generates all bigrams from a given iterable of names:
%% Cell type:code id:2f255c4f tags:
``` python
def bigrams(names):
for name in names:
for x, y in zip('$' + name, name + '$'):
for x, y in zip("$" + name, name + "$"):
yield x, y
```
%% Cell type:markdown id:c3ca4a57 tags:
For example, here are the bigrams extracted from the first two names in our dataset:
%% Cell type:code id:8433fc7a tags:
``` python
[b for b in bigrams(names[:2])]
```
%% Cell type:markdown id:039626f7 tags:
> **🤯 Generator functions**
>
> Note that `bigrams()` is a **generator function**: It does not `return` a list of all bigrams, it `yield`s all bigrams. This is more efficient in terms of memory usage, especially when dealing with the larger datasets we will encounter in this course. If you have not worked with generators and iterators before, now is a good time to read up on them. [More information about generators](https://wiki.python.org/moin/Generators)
%% Cell type:markdown id:075264e3 tags:
## Estimating bigram probabilities
%% Cell type:markdown id:6b767e85 tags:
How does a language model generate text? Intuitively, we can imagine rolling a multi-sided dice whose sides correspond to the elements of the vocabulary. Whereas standard dice are fair (all sides land face up with the same uniform probability), the dice of language models are weighted.
**The basic idea behind a bigram model is to let the probability of the next element in a generated sequence depend on the previous element.** One can think of the model as consisting of several differently weighted dice, one for each preceding character.
To estimate the probabilities of a bigram model, we start by counting how often each bigram occurs in the dataset. We can keep track of these counts by arranging them in a matrix $M$. More formally, let $V = \{c_0, \dots, c_{n-1}\}$ be our character vocabulary, where $c_0 = \$$. Then the matrix entry $M_{ij}$ should be the count of the character bigram $c_ic_j$ in our list of names. To compute this matrix, we can use the code in the following cell:
%% Cell type:code id:bf3e27cb tags:
``` python
import torch
# Create a counts matrix with all zeros
counts = torch.zeros(len(char2idx), len(char2idx))
# Update the counts based on the bigrams
for x, y in bigrams(names):
counts[char2idx[x]][char2idx[y]] += 1
```
%% Cell type:markdown id:6b82e3ed tags:
Note that we represent matrices as *tensors* from the [PyTorch library](https://pytorch.org/). You will learn more about this library later in the course.
Now that we have the bigram counts, we are ready to define our bigram model. This model is essentially a conditional probability distribution over all possible next characters, given the immediately preceding character. Formally, using the notation introduced above, the model consists of probabilities of the form $P(c_j|c_i)$. To compute them, we divide each bigram count $M_{ij}$ by the sum along the row $M_{i:}$. We can accomplish this as follows:
Now that we have the bigram counts, we are ready to define our bigram model. This model is essentially a conditional probability distribution over all possible next characters, given the immediately preceding character. Formally, using the notation introduced above, the model consists of probabilities of the form $P(c_j\,|\,c_i)$, which quantify the probability of character $c_j$ given that the preceding character is $c_i$. To compute these probabilities, we divide each bigram count $M_{ij}$ by the sum along the row $M_{i:}$. We can accomplish this as follows:
%% Cell type:code id:a75cb80f tags:
``` python
model = counts / counts.sum(dim=-1, keepdim=True)
```
%% Cell type:markdown id:bbe749cf tags:
> **🤔 Problem 3: Inspecting the model**
>
> In a name, which letter is most likely to come after the letter `a`? Which letter is most likely to start or end a name? Can you give an example of a name that is impossible according to our bigram model?
#### 🎈 Task 3: Inspecting the model
In a name, which letter is most likely to come after the letter `a`? Which letter is most likely to start or end a name? Can you give an example of a name that is impossible according to our bigram model?
%% Cell type:markdown id:052d9ba1 tags:
## Generating text
%% Cell type:markdown id:5b651c2a tags:
Now that we have estimated the probabilities of our bigram model, we can generate text. To do so, we repeatedly “roll a dice” by sampling from the next-character distribution, conditioning on the previous character. Equivalently, we can think of sampling from a collection of categorial distributions over the next character, one distribution per previous character.
%% Cell type:code id:51db917d tags:
``` python
# Construct the inverse of the character-to-index mapping
idx2char = {i: c for c, i in char2idx.items()}
# Generate 5 samples
for _ in range(5):
# We begin with the start-of-sequence marker
generated = '$'
generated = "$"
while True:
# Look up the integer index of the previous character
previous_idx = char2idx[generated[-1]]
# Get the relevant probability distribution
probs = model[previous_idx]
# Sample an index from the distribution
next_idx = torch.multinomial(probs, num_samples=1).item()
# Get the corresponding character
next_char = idx2char[next_idx]
# Break if the model generates the end-of-sequence marker
if next_char == '$':
if next_char == "$":
break
# Add the next character to the output
generated = generated + next_char
# Print the generated output (without the end-of-sequence marker)
print(generated[1:])
```
%% Cell type:markdown id:eacff22f tags:
As we can see, the strings generated by our bigram model only vaguely resemble actual names. This should not really surprise us: after all, each next character is generated by only looking at the immediately preceding character, which is too short a context to model many important aspects of names.
%% Cell type:markdown id:e0423a52 tags:
> **🤔 Problem 4: Probability of a name**
>
> What is the probability for it to generate your name? What is the probability for our model to generate the single-letter “name” `s`?
#### 🎈 Task 4: Probability of a name
What is the probability of our model generating your name? What is the probability of it generating the single-letter “name” `s`?
%% Cell type:markdown id:9c916c9f tags:
## Evaluating language models
%% Cell type:markdown id:d49f4996 tags:
Language models are commonly evaluated by computing their **perplexity**, which can be thought of as a measure of how “surprised” the model is when being exposed to a text. The larger the perplexity on a given text, the less likely it is that the model would have generated that text.
Language models are commonly evaluated by computing their **perplexity**, which can be thought of as a measure of how “perplexed” the model is when being exposed to a text. The larger the perplexity on a given text, the less likely the model would have generated that text.
To compute the perplexity of our bigram model on a reference text, we first compute the average negative log likelihood that the model assigns to a gold-standard next character after having seen the previous character. To get the perplexity, we then exponentiate the result.
%% Cell type:code id:eed97cc6 tags:
``` python
import math
# Collect the negative log likelihoods
nlls = []
for prev_char, next_char in bigrams(names):
prev_i = char2idx[prev_char]
prev_j = char2idx[next_char]
nlls.append(- math.log(model[prev_i][prev_j]))
nlls.append(-math.log(model[prev_i][prev_j]))
# Compute the average
avg_nll = sum(nlls) / len(nlls)
# Compute the perplexity
ppl = math.exp(sum(nlls) / len(nlls))
# Print as perplexity
print(f'{math.exp(avg_nll):.1f}')
# Print the perplexity
print(f"{ppl:.1f}")
```
%% Cell type:markdown id:d3eb3029 tags:
> **🤔 Problem 5: Upper bound on perplexity**
>
> The perplexity of our bigram model can be read as the average number of characters the model must choose from when trying to “guess” the held-out text. The lowest possible perplexity value is 1, which corresponds to the case where each next character is completely certain and no guessing is necessary. What is a reasonable *upper bound* on the perplexity of our bigram model?
#### 🎈 Task 5: Upper bound on perplexity
The perplexity of our bigram model can be interpreted as the average number of characters the model must choose from when trying to “guess” the held-out text. The lowest possible perplexity value is 1, which corresponds to the case where each next character is completely certain and no guessing is necessary. What is a reasonable *upper bound* on the perplexity of our bigram model?
%% Cell type:markdown id:a4cb2241 tags:
That’s all folks!
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment