Add TODO cells for Test your code sections in L1
Compare changes
+ 30
− 0
⚠️ The dataset for this lab contains 18M tokens. This is very little as far as word embedding datasets are concerned – for example, the original word2vec model was pre-trained on 100B tokens. In spite of this, you will need to think about efficiency when processing the data and training your models. In particular, wherever possible you should use iterators rather than lists, and vectorize operations (using [NumPy](https://numpy.org) or [PyTorch](https://pytorch.org)) as much as possible.
The data for this lab comes as a bz2-compressed plain text file. It consists of 1,163,769 sentences, with one sentence per line and tokens separated by spaces. The cell below contains a wrapper class `SimpleWikiDataset` that can be used to iterate over the sentences (lines) in the text file. On the Python side of things, each sentence is represented as a list of tokens (strings).
```
```
```
```
```
> Reads from an iterable of *sentences* (lists of string tokens) and returns a pair *vocab*, *counts* where *vocab* is a dictionary representing the vocabulary and *counts* is a 1D-[ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) with the absolute frequencies (counts) of the words in the vocabulary. The dictionary *vocab* maps words to a contiguous range of integers starting at 0. In the *counts* array, the entry at index $i$ is the count of that word in *vocab* which maps to $i$. Words that occur less than *min_count* times are excluded from the vocabulary.
To test your code, print the sizes of the vocabularies constructed from the two datasets, as well as the count totals. The correct vocabulary size for the minimal dataset is 3,231; for the full dataset, the correct vocabulary size is 73,339. The correct totals are 155,818 for the minimal dataset and 17,297,355 for the full dataset.
```
```
> Reads from an iterable of *sentences* (lists of string tokens) and yields the preprocessed sentences as non-empty lists of word ids (integers). Words not in *vocab* are discarded. The remaining words are randomly discarded according to the subsampling strategy with the given *threshold*. In the non-empty sentences, each token is replaced by its id in the vocabulary.
**⚠️ Please observe** that your function should *yield* the preprocessed sentences, not return a list with all of them. That is, we ask you to write a *generator function*. If you have not worked with generators and iterators before, now is a good time to read up on them. [More information about generators](https://wiki.python.org/moin/Generators)
Test your code by comparing the total number of tokens in the preprocessed version of each dataset with the corresponding number for the original data. The former should be ca. 59% of the latter for the minimal dataset, and ca. 69% for the full dataset. The exact percentage will vary slightly because of the randomness in the sampling. You may want to repeat your computation several times.
```
The general plan for solving this problem is to implement a generator function that traverses the preprocessed sentences, at each position of the text samples a window, and then extracts all positive examples from it. For each positive example, the function also generates $k$ negative examples, where $k$ is a hyperparameter. Finally, all examples (positive and negative) are combined into the tensor representation described below.
How should you represent a batch of training examples? Writing $B$ for the batch size, the obvious choice would be to represent the inputs as a matrix of shape $[B, 2]$ and the output labels (positive/negative) as a vector of length $B$. This representation would be quite wasteful on the input side, however, as each target word (index) from a positive example would have to be repeated in all negative samples. For example ($k=3$):
Here you will use a different representation: First, instead of a single input batch, there will be a *pair* of input batches – a vector for the target words and a matrix for the context words. If the target word vector has length $B$, the context word matrix has shape $[B, 1+k]$. The $i$th element of the target word vector is the target word for *all* context words in the $i$th row of the context word matrix: the first column of that row comes from a positive example, the remaining columns come from the $k$ negative samples. Accordingly, the batch with the output labels will be a matrix of the same shape as the context word matrix, with its first column set to 1 and its remaining columns set to 0. Corresponding to the example above:
To implement negative sampling from this distribution, you can follow a standard recipe: Start by pre-computing an array containing the *cumulative sums* of the exponentiated counts. Then, generate a random cumulative count $n$, and find that index in the pre-computed array at which $n$ should be inserted to keep the array sorted. That index identifies the sampled context word.
All operations in this recipe can be implemented efficiently in PyTorch; the relevant functions are [`torch.cumsum`](https://pytorch.org/docs/stable/generated/torch.cumsum.html) and [`torch.searchsorted`](https://pytorch.org/docs/stable/generated/torch.searchsorted.html). For optimal efficiency, you should sample all $B \times k$ negative examples in a batch at once.
```
> Reads from an iterable of *sentences* (lists of string tokens), preprocesses them using the function implemented in Problem 2, and then yields pairs of input batches for gradient-based training, represented as described above. Each batch contains *batch_size* positive examples. The parameter *window* specifies the maximal distance between a target word and a context word in a positive example; the actual window size around any given target word is sampled uniformly at random. The parameter *num_ns* specifies the number of negative samples per positive sample. The parameter *ns_exponent* specifies the exponent in the negative sampling (called $\alpha$ above).
To test your code, compare the total number of positive samples (across all batches) to the total number of tokens in the (un-preprocessed) minimal dataset. The ratio between these two values should be ca. 2.64. If you can spare the time, you can make the same comparison on the full dataset; here, the expected ratio is 3.25. As before, the numbers may vary slightly because of randomness, so you may want to run the comparison more than once.
```
Now it is time to implement the skip-gram model as such. The cell below contains skeleton code for this. As you will recall from Lecture 1.4, the core of the implementation is formed by two embedding layers: one for the target word representations, and one for the context word representations. Your task is to implement the missing `forward()` method.
```
> The input to this methods is a tensor *w* with target words of shape $[B]$ and a tensor *c* with context words of shape $[B, 1+k]$, where $B$ is the batch size and $k$ is the number of negative samples. The two tensors are structured as explained for Problem 3. The output of the method is a tensor $D$ of shape $[B, k+1]$ where entry $D_{ij}$ is the dot product between the embedding vector for the $i$th target word and the embedding vector for the context word in row $i$, column $j$.
```
```
First, instead of categorical cross entropy, you will use binary cross entropy. Just like the standard implementation of the softmax classifier, the skip-gram model does not include a final non-linearity, so you should use [`binary_cross_entropy_with_logits()`](https://pytorch.org/docs/1.9.1/generated/torch.nn.functional.binary_cross_entropy_with_logits.html).
```
```
Training on the full dataset will take some time – on a CPU, you should expect 10–40 minutes per epoch, depending on hardware. To give you some guidance: The total number of positive examples is approximately 58M, and the batch size is chosen so that each batch contains roughly 10% of these examples. To speed things up, you can train using a GPU; our reference implementation runs in less than 2 minutes per epoch on [Colab](http://colab.research.google.com).
```
Now that you have a trained model, you will probably be curious to see what it has learned. You can inspect your embeddings using the [Embedding Projector](http://projector.tensorflow.org). To that end, click on the ‘Load’ button, which will open up a dialogue with instructions for how to upload embeddings from your computer.
```