Skip to content
Snippets Groups Projects
Commit a572b5ef authored by Marco Kuhlmann's avatar Marco Kuhlmann
Browse files

Fix a hyperparameter

parent a162320c
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# The CBOW classifier # The CBOW classifier
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In this notebook you will learn how to implement the CBOW classifier from Lecture 1.3. In this notebook you will learn how to implement the CBOW classifier from Lecture 1.3.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Loading the data ## Loading the data
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We use the same dataset and the same helper functions as in the notebook on softmax regression. We use the same dataset and the same helper functions as in the notebook on softmax regression.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def load_data(filename, max_length=20): def load_data(filename, max_length=20):
items = [] items = []
with open(filename, 'rt', encoding='utf-8') as fp: with open(filename, 'rt', encoding='utf-8') as fp:
for line in fp: for line in fp:
sentence, label = line.rstrip().split('\t') sentence, label = line.rstrip().split('\t')
items.append((sentence.split()[:max_length], int(label))) items.append((sentence.split()[:max_length], int(label)))
return items return items
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Load the training data and the development data: Load the training data and the development data:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
train_data = load_data('sst-5-train.txt') train_data = load_data('sst-5-train.txt')
dev_data = load_data('sst-5-dev.txt') dev_data = load_data('sst-5-dev.txt')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Vectorizing the data ## Vectorizing the data
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
When we implemented softmax regression, we represented each sentence as a count vector. For the CBOW classifier we will use a different representation where each vector contains the word ids of the tokens in the sentence. We pad shorter sentences with zeroes. When we implemented softmax regression, we represented each sentence as a count vector. For the CBOW classifier we will use a different representation where each vector contains the word ids of the tokens in the sentence. We pad shorter sentences with zeroes.
As in the notebook on softmax regression, we first construct our vocabulary. Note that we reserve the word id 0 for padding. As in the notebook on softmax regression, we first construct our vocabulary. Note that we reserve the word id 0 for padding.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def make_vocab(data): def make_vocab(data):
vocab = {'<pad>': 0} vocab = {'<pad>': 0}
for sentence, label in data: for sentence, label in data:
for t in sentence: for t in sentence:
if t not in vocab: if t not in vocab:
vocab[t] = len(vocab) vocab[t] = len(vocab)
return vocab return vocab
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We create the vocabulary from the training data: We create the vocabulary from the training data:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
vocab = make_vocab(train_data) vocab = make_vocab(train_data)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Next we create our vector representation: Next we create our vector representation:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import torch import torch
def vectorize(vocab, data): def vectorize(vocab, data):
# All vectors will be right-padded up to the maximal length. # All vectors will be right-padded up to the maximal length.
max_length = max(len(s) for s, _ in data) max_length = max(len(s) for s, _ in data)
xs = [] xs = []
ys = [] ys = []
for sentence, label in data: for sentence, label in data:
x = [0] * max_length x = [0] * max_length
for i, w in enumerate(sentence): for i, w in enumerate(sentence):
if w in vocab: if w in vocab:
x[i] = vocab[w] x[i] = vocab[w]
xs.append(x) xs.append(x)
ys.append(label) ys.append(label)
return torch.LongTensor(xs), torch.LongTensor(ys) return torch.LongTensor(xs), torch.LongTensor(ys)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We vectorize the training data and the development data: We vectorize the training data and the development data:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
train_x, train_y = vectorize(vocab, train_data) train_x, train_y = vectorize(vocab, train_data)
dev_x, dev_y = vectorize(vocab, dev_data) dev_x, dev_y = vectorize(vocab, dev_data)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The resulting tensors are much smaller than the corresponding tensors from the notebook on softmax regression: The resulting tensors are much smaller than the corresponding tensors from the notebook on softmax regression:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print('Training data:', train_x.size(), train_y.size()) print('Training data:', train_x.size(), train_y.size())
print('Development data:', dev_x.size(), dev_y.size()) print('Development data:', dev_x.size(), dev_y.size())
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Evaluation ## Evaluation
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
As in the case of softmax regression, we evaluate our classify using accuracy: As in the case of softmax regression, we evaluate our classify using accuracy:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def accuracy(y_pred, y): def accuracy(y_pred, y):
return torch.mean(torch.eq(y_pred, y).float()).item() return torch.mean(torch.eq(y_pred, y).float()).item()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Recall that our baseline (always predict the label&nbsp;3) gives us an accuracy of slightly above 25%: Recall that our baseline (always predict the label&nbsp;3) gives us an accuracy of slightly above 25%:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
accuracy(torch.full_like(dev_y, 3), dev_y) accuracy(torch.full_like(dev_y, 3), dev_y)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Training the model ## Training the model
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We are now ready to set up the CBOW model and train it using cross-entropy loss. We are now ready to set up the CBOW model and train it using cross-entropy loss.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import torch.nn as nn import torch.nn as nn
import torch.nn.functional as F import torch.nn.functional as F
import torch.optim as optim import torch.optim as optim
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
As with softmax regression, we will train our model using minibatch gradient descent. As with softmax regression, we will train our model using minibatch gradient descent.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def minibatches(x, y, batch_size): def minibatches(x, y, batch_size):
random_indices = torch.randperm(x.size(0)) random_indices = torch.randperm(x.size(0))
for i in range(0, x.size(0) - batch_size + 1, batch_size): for i in range(0, x.size(0) - batch_size + 1, batch_size):
batch_indices = random_indices[i:i+batch_size] batch_indices = random_indices[i:i+batch_size]
yield x[batch_indices], y[batch_indices] yield x[batch_indices], y[batch_indices]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Model ### Model
Recall that a CBOW model consists of an embedding layer followed by an element-wise mean, a linear layer, and a final softmax. In PyTorch, embedding layers are implemented by the class [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). We use the parameter `padding_idx` to prevent the embedding for the special padding marker `<pad>` from being initialized and updated during training. Recall that a CBOW model consists of an embedding layer followed by an element-wise mean, a linear layer, and a final softmax. In PyTorch, embedding layers are implemented by the class [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). We use the parameter `padding_idx` to prevent the embedding for the special padding marker `<pad>` from being initialized and updated during training.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
class CBOW(nn.Module): class CBOW(nn.Module):
def __init__(self, num_embeddings, embedding_dim, num_classes): def __init__(self, num_embeddings, embedding_dim, num_classes):
super().__init__() super().__init__()
self.embedding = nn.Embedding(num_embeddings, embedding_dim, padding_idx=0) self.embedding = nn.Embedding(num_embeddings, embedding_dim, padding_idx=0)
self.linear = nn.Linear(embedding_dim, num_classes) self.linear = nn.Linear(embedding_dim, num_classes)
def forward(self, x): def forward(self, x):
# Note that the vector mean is computed per sentence (axis -2) # Note that the vector mean is computed per sentence (axis -2)
return self.linear(torch.mean(self.embedding(x), dim=-2)) return self.linear(torch.mean(self.embedding(x), dim=-2))
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Training loop ### Training loop
For training, we can reuse essentially the same training loop as for softmax regression. The only major change is the use of the Adam optimizer instead of SGD. The Adam optimizer will be our default choice in later units. For training, we can reuse essentially the same training loop as for softmax regression. The only major change is the use of the Adam optimizer instead of SGD. The Adam optimizer will be our default choice in later units.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import tqdm import tqdm
%matplotlib inline %matplotlib inline
%config InlineBackend.figure_format = 'retina' %config InlineBackend.figure_format = 'retina'
def train(n_epochs=10, batch_size=24, lr=1e-2, stop_early=False): def train(n_epochs=10, batch_size=24, lr=1e-2, stop_early=False):
# Initialize the model # Initialize the model
model = CBOW(len(vocab), 50, 5) model = CBOW(len(vocab), 30, 5)
# Initialize the optimizer # Initialize the optimizer
optimizer = optim.Adam(model.parameters(), lr=lr) optimizer = optim.Adam(model.parameters(), lr=lr)
# We will keep track of the losses on the two datasets # We will keep track of the losses on the two datasets
train_losses = [] train_losses = []
dev_losses = [] dev_losses = []
dev_accuracies = [] dev_accuracies = []
info = {'dev loss': 0, 'dev acc': 0} info = {'dev loss': 0, 'dev acc': 0}
with tqdm.tqdm(total=n_epochs) as pbar: with tqdm.tqdm(total=n_epochs) as pbar:
for t in range(n_epochs): for t in range(n_epochs):
pbar.set_description(f'Epoch {t+1}') pbar.set_description(f'Epoch {t+1}')
# Start training # Start training
model.train() model.train()
running_loss = 0 running_loss = 0
for bx, by in minibatches(train_x, train_y, batch_size): for bx, by in minibatches(train_x, train_y, batch_size):
optimizer.zero_grad() optimizer.zero_grad()
output = model.forward(bx) output = model.forward(bx)
loss = F.cross_entropy(output, by) loss = F.cross_entropy(output, by)
loss.backward() loss.backward()
optimizer.step() optimizer.step()
running_loss += loss.item() * len(bx) running_loss += loss.item() * len(bx)
train_losses.append(running_loss / len(train_x)) train_losses.append(running_loss / len(train_x))
# Start eveluation # Start eveluation
model.eval() model.eval()
with torch.no_grad(): with torch.no_grad():
dev_output = model.forward(dev_x) dev_output = model.forward(dev_x)
dev_loss = F.cross_entropy(dev_output, dev_y) dev_loss = F.cross_entropy(dev_output, dev_y)
dev_losses.append(dev_loss) dev_losses.append(dev_loss)
dev_y_pred = torch.argmax(dev_output, axis=1) dev_y_pred = torch.argmax(dev_output, axis=1)
dev_acc = accuracy(dev_y_pred, dev_y) dev_acc = accuracy(dev_y_pred, dev_y)
info['dev loss'] = f'{dev_loss:.4f}' info['dev loss'] = f'{dev_loss:.4f}'
info['dev acc'] = f'{dev_acc:.4f}' info['dev acc'] = f'{dev_acc:.4f}'
pbar.set_postfix(info) pbar.set_postfix(info)
pbar.update() pbar.update()
# Stop early if accuracy has not improved # Stop early if accuracy has not improved
if stop_early and dev_accuracies and dev_accuracies[-1] > dev_acc: if stop_early and dev_accuracies and dev_accuracies[-1] > dev_acc:
break break
else: else:
dev_accuracies.append(dev_acc) dev_accuracies.append(dev_acc)
# Plotting # Plotting
plt.figure(figsize=(15, 6)) plt.figure(figsize=(15, 6))
plt.subplot(121) plt.subplot(121)
plt.plot(train_losses) plt.plot(train_losses)
plt.plot(dev_losses) plt.plot(dev_losses)
plt.xlabel('Epoch') plt.xlabel('Epoch')
plt.ylabel('Average loss') plt.ylabel('Average loss')
plt.subplot(122) plt.subplot(122)
plt.plot(dev_accuracies) plt.plot(dev_accuracies)
plt.xlabel('Epoch') plt.xlabel('Epoch')
plt.ylabel('Development set accuracy') plt.ylabel('Development set accuracy')
return model return model
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Ready to train! Ready to train!
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
train() train()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The accuracy of our CBOW classifier is higher than what [Socher et al., 2013](https://www.aclweb.org/anthology/D13-1170) reported. (The relevant point of comparison is the accuracy of their VecAvg method in the fine-grained setting, as reported in their Table&nbsp;1.) However, the accuracy is still a bit *below* that of the classifier that we implemented in the notebook on softmax regression. On the pro side, the CBOW classifier uses much more compact vectors. The accuracy of our CBOW classifier is higher than what [Socher et al., 2013](https://www.aclweb.org/anthology/D13-1170) reported. (The relevant point of comparison is the accuracy of their VecAvg method in the fine-grained setting, as reported in their Table&nbsp;1.) However, the accuracy is still a bit *below* that of the classifier that we implemented in the notebook on softmax regression. On the pro side, the CBOW classifier uses much more compact vectors.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**🤔 Exploration 1: Add dropout** **🤔 Exploration 1: Add dropout**
> Looking at the loss curve on the development data, it seems that the model starts to overfit after only a few epochs. One way to counteract this is to regularise the model by inserting a [**dropout layer**](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) between the mean and the final linear layer. Evaluate what effect this change has to the model accuracy. A standard dropout probability is 0.5. > Looking at the loss curve on the development data, it seems that the model starts to overfit after only a few epochs. One way to counteract this is to regularise the model by inserting a [**dropout layer**](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) between the mean and the final linear layer. Evaluate what effect this change has to the model accuracy. A standard dropout probability is 0.5.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**🤔 Exploration 2: Different random seeds** **🤔 Exploration 2: Different random seeds**
> One issue related to the evaluation of neural networks is that results can vary significantly across different random seeds. Because of this, and resources permitting, we should always train and evaluate the model with several random seeds. Produce a box plot for the accuracy of the CBOW model over 10 random seeds. How large is the variance? > One issue related to the evaluation of neural networks is that results can vary significantly across different random seeds. Because of this, and resources permitting, we should always train and evaluate the model with several random seeds. Produce a box plot for the accuracy of the CBOW model over 10 random seeds. How large is the variance?
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**🤔 Exploration 3: Task-specific embeddings** **🤔 Exploration 3: Task-specific embeddings**
> As mentioned in Lecture&nbsp;1.3, the word embeddings that are learned by neural networks are tuned towards the specific task that the network is trained on. Use the code provided in Lab&nbsp;L1 to save the embeddings learned by the CBOW classifier and explore them in the [Embedding Projector](https://projector.tensorflow.org). Look at the neighbourhoods of sentiment-laden words. Does the result match your expectations? > As mentioned in Lecture&nbsp;1.3, the word embeddings that are learned by neural networks are tuned towards the specific task that the network is trained on. Use the code provided in Lab&nbsp;L1 to save the embeddings learned by the CBOW classifier and explore them in the [Embedding Projector](https://projector.tensorflow.org). Look at the neighbourhoods of sentiment-laden words. Does the result match your expectations?
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
That’s all folks! That’s all folks!
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment