Riley Capshaw · ed3d5e09
--- a/labs/l1/NLP-L1.ipynb

+ 30

− 0
+++ b/labs/l1/NLP-L1.ipynb

+ 30

− 0
 %% Cell type:markdown id: tags:

 # L1: Word representations

 %% Cell type:markdown id: tags:

 In this lab you will implement the **skip-gram model with negative sampling (SGNS)** from Lecture&nbsp;1.4, and use it to train word embeddings on the text of the [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page).

 ⚠️ The dataset for this lab contains 18M tokens. This is very little as far as word embedding datasets are concerned – for example, the original word2vec model was pre-trained on 100B tokens. In spite of this, you will need to think about efficiency when processing the data and training your models. In particular, wherever possible you should use iterators rather than lists, and vectorize operations (using [NumPy](https://numpy.org) or [PyTorch](https://pytorch.org)) as much as possible.

 %% Cell type:markdown id: tags:

 ## Load the data

 %% Cell type:markdown id: tags:

 The data for this lab comes as a bz2-compressed plain text file. It consists of 1,163,769 sentences, with one sentence per line and tokens separated by spaces. The cell below contains a wrapper class `SimpleWikiDataset` that can be used to iterate over the sentences (lines) in the text file. On the Python side of things, each sentence is represented as a list of tokens (strings).

 %% Cell type:code id: tags:

 ``` python
 import bz2

 class SimpleWikiDataset():

    def __init__(self, max_sentences=None):
        self.max_sentences = max_sentences

    def __iter__(self):
        with bz2.open('simplewiki.txt.bz2', 'rt') as sentences:
            for i, sentence in enumerate(sentences):
                if self.max_sentences and i >= self.max_sentences:
                    break
                yield sentence.split()
 ```

 %% Cell type:markdown id: tags:

 Using this class, we define two variants of the dataset: the full dataset and a minimal version with the first 1% of the sentences in the full dataset. The latter will be useful to test code without running it on the full dataset.

 %% Cell type:code id: tags:

 ``` python
 # Dataset with all sentences (N = 1,163,769)
 full_dataset = SimpleWikiDataset()

 # Minimal dataset
 mini_dataset = SimpleWikiDataset(max_sentences=11638)
 ```

 %% Cell type:markdown id: tags:

 The next code cell defines a generator function that allows you to iterate over all tokens in a dataset:

 %% Cell type:code id: tags:

 ``` python
 def tokens(sentences):
    for sentence in sentences:
        for token in sentence:
            yield token
 ```

 %% Cell type:markdown id: tags:

 To illustrate how to use this function, here is code that prints the number of tokens in the full dataset:

 %% Cell type:code id: tags:

 ``` python
 print(sum(1 for t in tokens(full_dataset)))
 ```

 %% Cell type:markdown id: tags:

 ## Problem 1: Build the vocabulary and frequency table

 %% Cell type:markdown id: tags:

 Your first task is to construct the embedding **vocabulary** – the set of unique words that will receive an embedding. Because you will eventually need to map words to vector dimensions, you will represent the vocabulary as a dictionary that maps words (strings) to a contiguous range of integers.

 Along with the vocabulary, you will also construct the **frequency table**, that is, the table that holds the absolute frequencies (counts) in the data, for all words in your vocabulary. This will simply be an array of integers, indexed by the word ids in the vocabulary.

 %% Cell type:markdown id: tags:

 To construct the vocabulary and the frequency table, complete the skeleton code in the cell below:

 %% Cell type:code id: tags:

 ``` python
 import numpy as np

 def make_vocab_and_counts(sentences, min_count=5):
    # TODO: Replace the next line with your own code
    return {}, np.array(())
 ```

 %% Cell type:markdown id: tags:

 Your code should comply with the following specification:

 **make_vocab_and_counts** (*sentences*, *min_count* = 5)

 > Reads from an iterable of *sentences* (lists of string tokens) and returns a pair *vocab*, *counts* where *vocab* is a dictionary representing the vocabulary and *counts* is a 1D-[ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) with the absolute frequencies (counts) of the words in the vocabulary. The dictionary *vocab* maps words to a contiguous range of integers starting at&nbsp;0. In the *counts* array, the entry at index $i$ is the count of that word in *vocab* which maps to $i$. Words that occur less than *min_count* times are excluded from the vocabulary.

 %% Cell type:markdown id: tags:

 ### 🤞 Test your code

 To test your code, print the sizes of the vocabularies constructed from the two datasets, as well as the count totals. The correct vocabulary size for the minimal dataset is 3,231; for the full dataset, the correct vocabulary size is 73,339. The correct totals are 155,818 for the minimal dataset and 17,297,355 for the full dataset.

+%% Cell type:code id: tags:
+
+``` python
+# TODO: Test your code here.
+```
+
 %% Cell type:markdown id: tags:

 ## Problem 2: Preprocess the data

 %% Cell type:markdown id: tags:

 Your next task is to preprocess the training data. This involves the following:

 * Discard words that are not in the vocabulary
 * Map each word to its vocabulary id
 * Randomly discard words according to the subsampling strategy covered in Lecture&nbsp;1.4
 * Discard sentences that have become empty

 As a reminder, the subsampling strategy involves discarding tokens $w$ with probability

 $$
 P(w) = \max (0, 1-\sqrt{tN/\#(w)})
 $$

 where $\#(w)$ is the count of $w$, $N$ is the total number of counts, and $t$ is the chosen threshold (default value: 0.001).

 %% Cell type:markdown id: tags:

 The cell below contains skeleton code for a generator function `preprocess`:

 %% Cell type:code id: tags:

 ``` python
 def preprocess(vocab, counts, sentences, threshold=0.001):
    # TODO: Replace the next line with your own code
    return iter(())
 ```

 %% Cell type:markdown id: tags:

 Extend this skeleton code into a function that implements the preprocessing. Your code should comply with the following specification:

 **preprocess** (*vocab*, *counts*, *sentences*, *threshold* = 0.001)

 > Reads from an iterable of *sentences* (lists of string tokens) and yields the preprocessed sentences as non-empty lists of word ids (integers). Words not in *vocab* are discarded. The remaining words are randomly discarded according to the subsampling strategy with the given *threshold*. In the non-empty sentences, each token is replaced by its id in the vocabulary.

 **⚠️ Please observe** that your function should *yield* the preprocessed sentences, not return a list with all of them. That is, we ask you to write a *generator function*. If you have not worked with generators and iterators before, now is a good time to read up on them. [More information about generators](https://wiki.python.org/moin/Generators)

 %% Cell type:markdown id: tags:

 ### 🤞 Test your code

 Test your code by comparing the total number of tokens in the preprocessed version of each dataset with the corresponding number for the original data. The former should be ca. 59% of the latter for the minimal dataset, and ca. 69% for the full dataset. The exact percentage will vary slightly because of the randomness in the sampling. You may want to repeat your computation several times.

+%% Cell type:code id: tags:
+
+``` python
+# TODO: Test your code here.
+```
+
 %% Cell type:markdown id: tags:

 ## Problem 3: Generate the training examples

 %% Cell type:markdown id: tags:

 Your next task is to translate the preprocessed sentences into training examples for the skip-gram model: both *positive examples* (target word–context word pairs actually observed in the data) and *negative examples* (pairs randomly sampled from a noise distribution).

 **⚠️ We expect that solving this problem will take you the longest time in this lab.**

 ### General strategy

 The general plan for solving this problem is to implement a generator function that traverses the preprocessed sentences, at each position of the text samples a window, and then extracts all positive examples from it. For each positive example, the function also generates $k$ negative examples, where $k$ is a hyperparameter. Finally, all examples (positive and negative) are combined into the tensor representation described below.

 ### Representation

 How should you represent a batch of training examples? Writing $B$ for the batch size, the obvious choice would be to represent the inputs as a matrix of shape $[B, 2]$ and the output labels (positive/negative) as a vector of length $B$. This representation would be quite wasteful on the input side, however, as each target word (index) from a positive example would have to be repeated in all negative samples. For example ($k=3$):

 %% Cell type:raw id: tags:

 tensor([[34,  237],    # positive example 1
        [34, 2561],    # negative example 1.1
        [34,   39],    # negative example 1.2
        [34,  903],    # negative example 1.3
        [34, 2036],    # positive example 2
        [34, 2132],    # negative example 2.1
        [34,  576],    # negative example 2.2
        [34, 2437]])   # negative example 2.3

 %% Cell type:markdown id: tags:

 Here you will use a different representation: First, instead of a single input batch, there will be a *pair* of input batches – a vector for the target words and a matrix for the context words. If the target word vector has length $B$, the context word matrix has shape $[B, 1+k]$. The $i$th element of the target word vector is the target word for *all* context words in the $i$th row of the context word matrix: the first column of that row comes from a positive example, the remaining columns come from the $k$ negative samples. Accordingly, the batch with the output labels will be a matrix of the same shape as the context word matrix, with its first column set to&nbsp;1 and its remaining columns set to&nbsp;0. Corresponding to the example above:

 %% Cell type:raw id: tags:

 # input batch component 1: target word vector
 tensor([34, 34])

 # input batch component 2: context word matrix
 tensor([[237, 2561, 39, 903], [2036, 2132, 576, 2437]])

 # output labels
 tensor([[1, 0, 0, 0], [1, 0, 0, 0]])

 %% Cell type:markdown id: tags:

 For the present problem, you will only be concerned with the two input batches; the output batch will be constructed in the training procedure. In fact, for a fixed batch size $B$, that batch is always exactly the same, so you will only have to build it once.

 ### Negative sampling

 Recall from Lecture&nbsp;1.4 that the probability of a word $c$ to be selected as the context word in a negative sample is proportional to its exponentiated count $\#(c)^\alpha$, where $\alpha$ is a hyperparameter (default value: 0.75).

 To implement negative sampling from this distribution, you can follow a standard recipe: Start by pre-computing an array containing the *cumulative sums* of the exponentiated counts. Then, generate a random cumulative count $n$, and find that index in the pre-computed array at which $n$ should be inserted to keep the array sorted. That index identifies the sampled context word.

 All operations in this recipe can be implemented efficiently in PyTorch; the relevant functions are [`torch.cumsum`](https://pytorch.org/docs/stable/generated/torch.cumsum.html) and [`torch.searchsorted`](https://pytorch.org/docs/stable/generated/torch.searchsorted.html). For optimal efficiency, you should sample all $B \times k$ negative examples in a batch at once.

 %% Cell type:markdown id: tags:

 Here is skeleton code for this problem:

 %% Cell type:code id: tags:

 ``` python
 def training_examples(vocab, counts, sentences, window=5, num_ns=5, batch_size=1<<19, ns_exponent=0.75):
    # TODO: Replace the next line with your own code
    yield torch.zeros(batch_size).long(), torch.zeros(batch_size, 1 + num_ns).long()
 ```

 %% Cell type:markdown id: tags:

 Your code should comply with the following specification:

 **training_examples** (*vocab*, *counts*, *sentences*, *window* = 5, *num_ns* = 5, *batch_size* = 524,288, *ns_exponent*=0.75)

 > Reads from an iterable of *sentences* (lists of string tokens), preprocesses them using the function implemented in Problem&nbsp;2, and then yields pairs of input batches for gradient-based training, represented as described above. Each batch contains *batch_size* positive examples. The parameter *window* specifies the maximal distance between a target word and a context word in a positive example; the actual window size around any given target word is sampled uniformly at random. The parameter *num_ns* specifies the number of negative samples per positive sample. The parameter *ns_exponent* specifies the exponent in the negative sampling (called $\alpha$ above).

 %% Cell type:markdown id: tags:

 ### 🤞 Test your code

 To test your code, compare the total number of positive samples (across all batches) to the total number of tokens in the (un-preprocessed) minimal dataset. The ratio between these two values should be ca. 2.64. If you can spare the time, you can make the same comparison on the full dataset; here, the expected ratio is 3.25. As before, the numbers may vary slightly because of randomness, so you may want to run the comparison more than once.

+%% Cell type:code id: tags:
+
+``` python
+# TODO: Test your code here.
+```
+
 %% Cell type:markdown id: tags:

 ## Problem 4: Implement the model

 %% Cell type:markdown id: tags:

 Now it is time to implement the skip-gram model as such. The cell below contains skeleton code for this. As you will recall from Lecture&nbsp;1.4, the core of the implementation is formed by two embedding layers: one for the target word representations, and one for the context word representations. Your task is to implement the missing `forward()` method.

 %% Cell type:code id: tags:

 ``` python
 import torch
 import torch.nn as nn
 import torch.nn.functional as F

 class SGNSModel(nn.Module):

    def __init__(self, vocab, embedding_dim):
        super().__init__()
        self.vocab = vocab
        self.w = nn.Embedding(len(vocab), embedding_dim)
        self.c = nn.Embedding(len(vocab), embedding_dim)

    def forward(self, w, c):
        # TODO: Replace the next line with your own code
        return torch.zeros_like(c, dtype=torch.float, requires_grad=True)
 ```

 %% Cell type:markdown id: tags:

 Your implementation of the `forward()` method should comply with the following specification:

 **forward** (*self*, *w*, *c*)

 > The input to this methods is a tensor *w* with target words of shape $[B]$ and a tensor *c* with context words of shape $[B, 1+k]$, where $B$ is the batch size and $k$ is the number of negative samples. The two tensors are structured as explained for Problem&nbsp;3. The output of the method is a tensor $D$ of shape $[B, k+1]$ where entry $D_{ij}$ is the dot product between the embedding vector for the $i$th target word and the embedding vector for the context word in row $i$, column $j$.

 **💡 Hint:** To compute a dot product $x^\top y$, you can first compute the Hadamard product $z = x \odot y$ and then sum up the elements of $z$.

 %% Cell type:markdown id: tags:

 ### 🤞 Test your code

 Test your code by creating an instance of the model, and check that `forward` returns the expected result on random input tensors *w* and *c*. To help you, the following function will return a random example from the first 100 examples produced by `training_examples`.

 %% Cell type:code id: tags:

 ``` python
+# TODO: Test your code here.
+```
+
+%% Cell type:code id: tags:
+
+``` python
 import numpy as np

 def random_example(vocab, counts, sentences):
    skip = np.random.randint(100)
    for i, example in enumerate(training_examples(vocab, counts, sentences, num_ns=1, batch_size=5)):
        if i >= skip:
            break
    return example
 ```

 %% Cell type:markdown id: tags:

 ## Problem 5: Train the model

 %% Cell type:markdown id: tags:

 Once you have a working model, it is time to train it. The training loop for the skip-gram model will be very similar to the prototypical training loop that you saw in Lecture&nbsp;0.6, with two things to note:

 First, instead of categorical cross entropy, you will use binary cross entropy. Just like the standard implementation of the softmax classifier, the skip-gram model does not include a final non-linearity, so you should use [`binary_cross_entropy_with_logits()`](https://pytorch.org/docs/1.9.1/generated/torch.nn.functional.binary_cross_entropy_with_logits.html).

 The second thing to note is that you will have to create the tensor with the output labels, as explained already in Problem&nbsp;3. This should be a matrix of size $[B, 1+k]$ whose first column contains $1$s and whose remaining columns contains $0$s.

 %% Cell type:markdown id: tags:

 Here is skeleton code for the training loop, including default values for the most important hyperparameters:

 %% Cell type:code id: tags:

 ``` python
 import torch
 import torch.nn.functional as F
 import torch.optim as optim

 def train(sentences, embedding_dim=50, window=5, num_ns=5, batch_size=1<<19, n_epochs=1, lr=1e-1):
    # Create the vocabulary and the counts
    vocab, counts = make_vocab_and_counts(sentences)

    # Initialize the model
    model = SGNSModel(vocab, embedding_dim)

    # Initialize the optimizer. Here we use Adam rather than plain SGD
    optimizer = optim.Adam(model.parameters(), lr=lr)

    # TODO: Add your code here

    return model
 ```

 %% Cell type:markdown id: tags:

 To show you how `train` is meant to be used, the code in the next cell trains a model on the minimal dataset.

 %% Cell type:code id: tags:

 ``` python
 model = train(mini_dataset, n_epochs=1)
 ```

 %% Cell type:markdown id: tags:

 ### 🤞 Test your code

 Test your implementation of the training loop by training a model on the minimal dataset. This should only take a few seconds. You will not get useful word vectors, but you will be able to see whether your code runs without errors.

 Once you have passed this test, you can train a model on the full dataset. Print the loss to check that the model is actually learning; if the loss is not decreasing, try to find the problem before wasting time (and energy) on useless training.

 Training on the full dataset will take some time – on a CPU, you should expect 10–40 minutes per epoch, depending on hardware. To give you some guidance: The total number of positive examples is approximately 58M, and the batch size is chosen so that each batch contains roughly 10% of these examples. To speed things up, you can train using a GPU; our reference implementation runs in less than 2 minutes per epoch on [Colab](http://colab.research.google.com).

+%% Cell type:code id: tags:
+
+``` python
+# TODO: Train your model on the full dataset here.
+```
+
 %% Cell type:markdown id: tags:

 ## Problem 6: Analyse the embeddings (reflection)

 %% Cell type:markdown id: tags:

 Now that you have a trained model, you will probably be curious to see what it has learned. You can inspect your embeddings using the [Embedding Projector](http://projector.tensorflow.org). To that end, click on the ‘Load’ button, which will open up a dialogue with instructions for how to upload embeddings from your computer.

 You will need to upload two tab-separated files. To create them, you can use the following code:

 %% Cell type:code id: tags:

 ``` python
 def save_model(model):
    # Extract the embedding vectors as a NumPy array
    embeddings = model.w.weight.detach().numpy()

    # Create the word–vector pairs
    items = sorted((i, w) for w, i in model.vocab.items())
    items = [(w, e) for (i, w), e in zip(items, embeddings)]

    # Write the embeddings and the word labels to files
    with open('vectors.tsv', 'wt') as fp1, open('metadata.tsv', 'wt') as fp2:
        for w, e in items:
            print('\t'.join('{:.5f}'.format(x) for x in e), file=fp1)
            print(w, file=fp2)
 ```

 %% Cell type:markdown id: tags:

 Take some time to explore the embedding space. In particular, inspect the local neighbourhoods of words that you are curious about, say the 10 closest neighbours. Document your exploration in a short reflection piece (ca. 150&nbsp;words). Respond to the following prompts:

 * Which words did you try? Which results did you get? Did you do anything else than inspecting local neighbourhoods?
 * Based on what you know about word embeddings, did you expect your results? How do you explain them?
 * What did you learn? How, exactly, did you learn it? Why does this learning matter?

 %% Cell type:markdown id: tags:

 *TODO: Enter your text here*

 %% Cell type:markdown id: tags:

 👍 Well done!