Fix a hyperparameter

a572b5ef · Marco Kuhlmann · a162320c · a572b5ef
Commit a572b5ef authored 4 years ago by Marco Kuhlmann
--- a/lectures/lecture-13/CBOW_Classifier.ipynb
+++ b/lectures/lecture-13/CBOW_Classifier.ipynb
@@ -306,7 +306,7 @@
    "\n",
    "def train(n_epochs=10, batch_size=24, lr=1e-2, stop_early=False):\n",
    "    # Initialize the model\n",
-    "    model = CBOW(len(vocab), 50, 5)\n",
+    "    model = CBOW(len(vocab), 30, 5)\n",
    "\n",
    "    # Initialize the optimizer\n",
    "    optimizer = optim.Adam(model.parameters(), lr=lr)\n",

 %% Cell type:markdown id: tags:
 # The CBOW classifier
 %% Cell type:markdown id: tags:
 In this notebook you will learn how to implement the CBOW classifier from Lecture 1.3.
 %% Cell type:markdown id: tags:
 ## Loading the data
 %% Cell type:markdown id: tags:
 We use the same dataset and the same helper functions as in the notebook on softmax regression.
 %% Cell type:code id: tags:
 ``` python
 def load_data(filename, max_length=20):
    items = []
    with open(filename, 'rt', encoding='utf-8') as fp:
        for line in fp:
            sentence, label = line.rstrip().split('\t')
            items.append((sentence.split()[:max_length], int(label)))
    return items
 ```
 %% Cell type:markdown id: tags:
 Load the training data and the development data:
 %% Cell type:code id: tags:
 ``` python
 train_data = load_data('sst-5-train.txt')
 dev_data = load_data('sst-5-dev.txt')
 ```
 %% Cell type:markdown id: tags:
 ## Vectorizing the data
 %% Cell type:markdown id: tags:
 When we implemented softmax regression, we represented each sentence as a count vector. For the CBOW classifier we will use a different representation where each vector contains the word ids of the tokens in the sentence. We pad shorter sentences with zeroes.
 As in the notebook on softmax regression, we first construct our vocabulary. Note that we reserve the word id&nbsp;0 for padding.
 %% Cell type:code id: tags:
 ``` python
 def make_vocab(data):
    vocab = {'<pad>': 0}
    for sentence, label in data:
        for t in sentence:
            if t not in vocab:
                vocab[t] = len(vocab)
    return vocab
 ```
 %% Cell type:markdown id: tags:
 We create the vocabulary from the training data:
 %% Cell type:code id: tags:
 ``` python
 vocab = make_vocab(train_data)
 ```
 %% Cell type:markdown id: tags:
 Next we create our vector representation:
 %% Cell type:code id: tags:
 ``` python
 import torch
 def vectorize(vocab, data):
    # All vectors will be right-padded up to the maximal length.
    max_length = max(len(s) for s, _ in data)
    xs = []
    ys = []
    for sentence, label in data:
        x = [0] * max_length
        for i, w in enumerate(sentence):
            if w in vocab:
                x[i] = vocab[w]
        xs.append(x)
        ys.append(label)
    return torch.LongTensor(xs), torch.LongTensor(ys)
 ```
 %% Cell type:markdown id: tags:
 We vectorize the training data and the development data:
 %% Cell type:code id: tags:
 ``` python
 train_x, train_y = vectorize(vocab, train_data)
 dev_x, dev_y = vectorize(vocab, dev_data)
 ```
 %% Cell type:markdown id: tags:
 The resulting tensors are much smaller than the corresponding tensors from the notebook on softmax regression:
 %% Cell type:code id: tags:
 ``` python
 print('Training data:', train_x.size(), train_y.size())
 print('Development data:', dev_x.size(), dev_y.size())
 ```
 %% Cell type:markdown id: tags:
 ## Evaluation
 %% Cell type:markdown id: tags:
 As in the case of softmax regression, we evaluate our classify using accuracy:
 %% Cell type:code id: tags:
 ``` python
 def accuracy(y_pred, y):
    return torch.mean(torch.eq(y_pred, y).float()).item()
 ```
 %% Cell type:markdown id: tags:
 Recall that our baseline (always predict the label&nbsp;3) gives us an accuracy of slightly above 25%:
 %% Cell type:code id: tags:
 ``` python
 accuracy(torch.full_like(dev_y, 3), dev_y)
 ```
 %% Cell type:markdown id: tags:
 ## Training the model
 %% Cell type:markdown id: tags:
 We are now ready to set up the CBOW model and train it using cross-entropy loss.
 %% Cell type:code id: tags:
 ``` python
 import torch.nn as nn
 import torch.nn.functional as F
 import torch.optim as optim
 ```
 %% Cell type:markdown id: tags:
 As with softmax regression, we will train our model using minibatch gradient descent.
 %% Cell type:code id: tags:
 ``` python
 def minibatches(x, y, batch_size):
    random_indices = torch.randperm(x.size(0))
    for i in range(0, x.size(0) - batch_size + 1, batch_size):
        batch_indices = random_indices[i:i+batch_size]
        yield x[batch_indices], y[batch_indices]
 ```
 %% Cell type:markdown id: tags:
 ### Model
 Recall that a CBOW model consists of an embedding layer followed by an element-wise mean, a linear layer, and a final softmax. In PyTorch, embedding layers are implemented by the class [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). We use the parameter `padding_idx` to prevent the embedding for the special padding marker `<pad>` from being initialized and updated during training.
 %% Cell type:code id: tags:
 ``` python
 class CBOW(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim, padding_idx=0)
        self.linear = nn.Linear(embedding_dim, num_classes)
    def forward(self, x):
        # Note that the vector mean is computed per sentence (axis -2)
        return self.linear(torch.mean(self.embedding(x), dim=-2))
 ```
 %% Cell type:markdown id: tags:
 ### Training loop
 For training, we can reuse essentially the same training loop as for softmax regression. The only major change is the use of the Adam optimizer instead of SGD. The Adam optimizer will be our default choice in later units.
 %% Cell type:code id: tags:
 ``` python
 import matplotlib.pyplot as plt
 import tqdm
 %matplotlib inline
 %config InlineBackend.figure_format = 'retina'
 def train(n_epochs=10, batch_size=24, lr=1e-2, stop_early=False):
    # Initialize the model
-    model = CBOW(len(vocab), 50, 5)
+    model = CBOW(len(vocab), 30, 5)
    # Initialize the optimizer
    optimizer = optim.Adam(model.parameters(), lr=lr)
    # We will keep track of the losses on the two datasets
    train_losses = []
    dev_losses = []
    dev_accuracies = []
    info = {'dev loss': 0, 'dev acc': 0}
    with tqdm.tqdm(total=n_epochs) as pbar:
        for t in range(n_epochs):
            pbar.set_description(f'Epoch {t+1}')
            # Start training
            model.train()
            running_loss = 0
            for bx, by in minibatches(train_x, train_y, batch_size):
                optimizer.zero_grad()
                output = model.forward(bx)
                loss = F.cross_entropy(output, by)
                loss.backward()
                optimizer.step()
                running_loss += loss.item() * len(bx)
            train_losses.append(running_loss / len(train_x))
            # Start eveluation
            model.eval()
            with torch.no_grad():
                dev_output = model.forward(dev_x)
                dev_loss = F.cross_entropy(dev_output, dev_y)
            dev_losses.append(dev_loss)
            dev_y_pred = torch.argmax(dev_output, axis=1)
            dev_acc = accuracy(dev_y_pred, dev_y)
            info['dev loss'] = f'{dev_loss:.4f}'
            info['dev acc'] = f'{dev_acc:.4f}'
            pbar.set_postfix(info)
            pbar.update()
            # Stop early if accuracy has not improved
            if stop_early and dev_accuracies and dev_accuracies[-1] > dev_acc:
                break
            else:
                dev_accuracies.append(dev_acc)
    # Plotting
    plt.figure(figsize=(15, 6))
    plt.subplot(121)
    plt.plot(train_losses)
    plt.plot(dev_losses)
    plt.xlabel('Epoch')
    plt.ylabel('Average loss')
    plt.subplot(122)
    plt.plot(dev_accuracies)
    plt.xlabel('Epoch')
    plt.ylabel('Development set accuracy')
    return model
 ```
 %% Cell type:markdown id: tags:
 Ready to train!
 %% Cell type:code id: tags:
 ``` python
 train()
 ```
 %% Cell type:markdown id: tags:
 The accuracy of our CBOW classifier is higher than what [Socher et al., 2013](https://www.aclweb.org/anthology/D13-1170) reported. (The relevant point of comparison is the accuracy of their VecAvg method in the fine-grained setting, as reported in their Table&nbsp;1.) However, the accuracy is still a bit *below* that of the classifier that we implemented in the notebook on softmax regression. On the pro side, the CBOW classifier uses much more compact vectors.
 %% Cell type:markdown id: tags:
 **🤔 Exploration 1: Add dropout**
 > Looking at the loss curve on the development data, it seems that the model starts to overfit after only a few epochs. One way to counteract this is to regularise the model by inserting a [**dropout layer**](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) between the mean and the final linear layer. Evaluate what effect this change has to the model accuracy. A standard dropout probability is 0.5.
 %% Cell type:markdown id: tags:
 **🤔 Exploration 2: Different random seeds**
 > One issue related to the evaluation of neural networks is that results can vary significantly across different random seeds. Because of this, and resources permitting, we should always train and evaluate the model with several random seeds. Produce a box plot for the accuracy of the CBOW model over 10 random seeds. How large is the variance?
 %% Cell type:markdown id: tags:
 **🤔 Exploration 3: Task-specific embeddings**
 > As mentioned in Lecture&nbsp;1.3, the word embeddings that are learned by neural networks are tuned towards the specific task that the network is trained on. Use the code provided in Lab&nbsp;L1 to save the embeddings learned by the CBOW classifier and explore them in the [Embedding Projector](https://projector.tensorflow.org). Look at the neighbourhoods of sentiment-laden words. Does the result match your expectations?
 %% Cell type:markdown id: tags:
 That’s all folks!