Skip to content
Snippets Groups Projects
Commit daa79130 authored by Marco Kuhlmann's avatar Marco Kuhlmann
Browse files

Remove output

parent 4f1cbc91
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# The CBOW classifier
%% Cell type:markdown id: tags:
In this notebook you will learn how to implement the CBOW classifier from Lecture 1.3.
%% Cell type:markdown id: tags:
## Loading the data
%% Cell type:markdown id: tags:
We use the same dataset and the same helper functions as in the notebook on softmax regression.
%% Cell type:code id: tags:
``` python
def load_data(filename, max_length=20):
items = []
with open(filename, 'rt', encoding='utf-8') as fp:
for line in fp:
sentence, label = line.rstrip().split('\t')
items.append((sentence.split()[:max_length], int(label)))
return items
```
%% Cell type:markdown id: tags:
Load the training data and the development data:
%% Cell type:code id: tags:
``` python
train_data = load_data('sst-5-train.txt')
dev_data = load_data('sst-5-dev.txt')
```
%% Cell type:markdown id: tags:
## Vectorizing the data
%% Cell type:markdown id: tags:
When we implemented softmax regression, we represented each sentence as a count vector. For the CBOW classifier we will use a different representation where each vector contains the word ids of the tokens in the sentence. We pad shorter sentences with zeroes.
As in the notebook on softmax regression, we first construct our vocabulary. Note that we reserve the word id 0 for padding.
%% Cell type:code id: tags:
``` python
def make_vocab(data):
vocab = {'<pad>': 0}
for sentence, label in data:
for t in sentence:
if t not in vocab:
vocab[t] = len(vocab)
return vocab
```
%% Cell type:markdown id: tags:
We create the vocabulary from the training data:
%% Cell type:code id: tags:
``` python
vocab = make_vocab(train_data)
```
%% Cell type:markdown id: tags:
Next we create our vector representation:
%% Cell type:code id: tags:
``` python
import torch
def vectorize(vocab, data):
# All vectors will be right-padded up to the maximal length.
max_length = max(len(s) for s, _ in data)
xs = []
ys = []
for sentence, label in data:
x = [0] * max_length
for i, w in enumerate(sentence):
if w in vocab:
x[i] = vocab[w]
xs.append(x)
ys.append(label)
return torch.LongTensor(xs), torch.LongTensor(ys)
```
%% Cell type:markdown id: tags:
We vectorize the training data and the development data:
%% Cell type:code id: tags:
``` python
train_x, train_y = vectorize(vocab, train_data)
dev_x, dev_y = vectorize(vocab, dev_data)
```
%% Cell type:markdown id: tags:
The resulting tensors are much smaller than the corresponding tensors from the notebook on softmax regression:
%% Cell type:code id: tags:
``` python
print('Training data:', train_x.shape, train_y.shape)
print('Development data:', dev_x.shape, dev_y.shape)
```
%% Output
Training data: torch.Size([8544, 20]) torch.Size([8544])
Development data: torch.Size([1101, 20]) torch.Size([1101])
%% Cell type:markdown id: tags:
**🤔 Problem 1: Padding**
> Why does the `vectorize` function pad all vectors to the same length? What would happen if it did not do so?
%% Cell type:markdown id: tags:
## Evaluation
%% Cell type:markdown id: tags:
As in the case of softmax regression, we evaluate our classify using accuracy:
%% Cell type:code id: tags:
``` python
def accuracy(y_pred, y):
return torch.mean(torch.eq(y_pred, y).float()).item()
```
%% Cell type:markdown id: tags:
Recall that our baseline (always predict the label&nbsp;3) gives us an accuracy of slightly above 25%:
%% Cell type:code id: tags:
``` python
accuracy(torch.full_like(dev_y, 3), dev_y)
```
%% Output
0.25340598821640015
%% Cell type:markdown id: tags:
## Training the model
%% Cell type:markdown id: tags:
We are now ready to set up the CBOW model and train it using cross-entropy loss.
%% Cell type:code id: tags:
``` python
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
```
%% Cell type:markdown id: tags:
As with softmax regression, we will train our model using minibatch gradient descent.
%% Cell type:code id: tags:
``` python
def minibatches(x, y, batch_size):
random_indices = torch.randperm(x.size(0))
for i in range(0, x.size(0) - batch_size + 1, batch_size):
batch_indices = random_indices[i:i+batch_size]
yield x[batch_indices], y[batch_indices]
```
%% Cell type:markdown id: tags:
### Model
Recall that a CBOW model consists of an embedding layer followed by an element-wise mean, a linear layer, and a final softmax. In PyTorch, embedding layers are implemented by the class [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). We use the parameter `padding_idx` to prevent the embedding for the special padding marker `<pad>` from being initialized and updated during training.
%% Cell type:code id: tags:
``` python
class CBOW(nn.Module):
def __init__(self, num_embeddings, embedding_dim, num_classes):
super().__init__()
self.embedding = nn.Embedding(num_embeddings, embedding_dim, padding_idx=0)
self.linear = nn.Linear(embedding_dim, num_classes)
def forward(self, x):
# 1. Note that the vector mean is computed per sentence (axis -2)
# 2. Note that there is no softmax; it will be implicit in the loss function.
return self.linear(torch.mean(self.embedding(x), dim=-2))
```
%% Cell type:markdown id: tags:
**🤔 Problem 2: Shapes**
> When writing PyTorch code, it is often very useful to annotate tensor-valued variables with their shapes. What is the shape of the variable `x` when it is passed as an argument to `forward`? What is the shape of the output of each of the operations in the `forward` method?
%% Cell type:markdown id: tags:
### Training loop
For training, we can reuse essentially the same training loop as for softmax regression. The only major change is the use of the Adam optimizer instead of SGD. The Adam optimizer will be our default choice in later units.
%% Cell type:code id: tags:
``` python
def train(n_epochs=10, batch_size=24, lr=1e-2):
# Initialize the model
model = CBOW(len(vocab), 30, 5)
# Initialize the optimizer
optimizer = optim.Adam(model.parameters(), lr=lr)
# Training loop proper
for t in range(n_epochs):
for bx, by in minibatches(train_x, train_y, batch_size):
optimizer.zero_grad()
output = model.forward(bx)
loss = F.cross_entropy(output, by)
loss.backward()
optimizer.step()
# Return the trained model
return model
```
%% Cell type:markdown id: tags:
Here is the embellished version of the training loop including evaluation on development data.
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
import tqdm
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
def train(n_epochs=10, batch_size=24, lr=1e-2):
# Initialize the model
model = CBOW(len(vocab), 30, 5)
# Initialize the optimizer
optimizer = optim.Adam(model.parameters(), lr=lr)
# We will keep track of the losses on the two datasets
train_losses = []
dev_losses = []
dev_accuracies = []
info = {'dev loss': 0, 'dev acc': 0}
with tqdm.tqdm(total=n_epochs) as pbar:
for t in range(n_epochs):
pbar.set_description(f'Epoch {t+1}')
# Start training
model.train()
running_loss = 0
for bx, by in minibatches(train_x, train_y, batch_size):
optimizer.zero_grad()
output = model.forward(bx)
loss = F.cross_entropy(output, by)
loss.backward()
optimizer.step()
running_loss += loss.item() * len(bx)
train_losses.append(running_loss / len(train_x))
# Start evaluation
model.eval()
with torch.no_grad():
dev_output = model.forward(dev_x)
dev_loss = F.cross_entropy(dev_output, dev_y)
dev_losses.append(dev_loss)
dev_y_pred = torch.argmax(dev_output, axis=1)
dev_acc = accuracy(dev_y_pred, dev_y)
dev_accuracies.append(dev_acc)
info['dev loss'] = f'{dev_loss:.4f}'
info['dev acc'] = f'{dev_acc:.4f}'
pbar.set_postfix(info)
pbar.update()
# Plotting
plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.plot(train_losses)
plt.plot(dev_losses)
plt.xlabel('Epoch')
plt.ylabel('Average loss')
plt.subplot(122)
plt.plot(dev_accuracies)
plt.xlabel('Epoch')
plt.ylabel('Development set accuracy')
return model
```
%% Cell type:markdown id: tags:
Ready to train!
%% Cell type:code id: tags:
``` python
train()
```
%% Output
Epoch 10: 100%|█| 10/10 [00:14<00:00, 1.46s/it, dev loss=2.3428, dev acc=0.3470
CBOW(
(embedding): Embedding(13690, 30, padding_idx=0)
(dropout): Dropout(p=0.5, inplace=False)
(linear): Linear(in_features=30, out_features=5, bias=True)
)
%% Cell type:markdown id: tags:
The accuracy of our CBOW classifier is higher than what [Socher et al., 2013](https://www.aclweb.org/anthology/D13-1170) reported. (The relevant point of comparison is the accuracy of their VecAvg method in the fine-grained setting, as reported in their Table&nbsp;1.) However, the accuracy is still a bit *below* that of the classifier that we implemented in the notebook on softmax regression. On the pro side, the CBOW classifier uses much more compact vectors.
%% Cell type:markdown id: tags:
**🤔 Exploration 1: Dropout**
> Looking at the loss curve on the development data, it seems that the model starts to overfit after only a few epochs. One way to counteract this is to regularise the model by inserting a [**dropout layer**](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) between the mean and the final linear layer. Evaluate what effect this change has to the model accuracy. A standard dropout probability is 0.5.
%% Cell type:markdown id: tags:
**🤔 Exploration 2: Early stopping**
> Another way to deal with overfitting is to stop the training early when the model accuracy no longer increases on the development (validation) data. Implement this strategy.
%% Cell type:markdown id: tags:
**🤔 Exploration 3: Different random seeds**
> One issue related to the evaluation of neural networks is that results can vary significantly across different random seeds. Because of this, and resources permitting, we should always train and evaluate the model with several random seeds. Produce a box plot for the accuracy of the CBOW model over 10 random seeds. How large is the variance?
%% Cell type:markdown id: tags:
**🤔 Exploration 4: Task-specific embeddings**
> As mentioned in Lecture&nbsp;1.3, the word embeddings that are learned by neural networks are tuned towards the specific task that the network is trained on. Use the code provided in Lab&nbsp;L1 to save the embeddings learned by the CBOW classifier and explore them in the [Embedding Projector](https://projector.tensorflow.org). Look at the neighbourhoods of sentiment-laden words. Does the result match your expectations?
%% Cell type:markdown id: tags:
That’s all folks!
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment