L3: Corrected vocab usage in training-loop to avoid potential collisions with earlier tests.
Compare changes
+ 2
− 2
In this lab, you will implement the encoder–decoder architecture presented in Lecture 3.2 ([Sutskever et al., 2014](https://papers.nips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf)), including the attention-based extension presented in Lecture 3.3 ([Bahdanau et al., 2015](https://arxiv.org/abs/1409.0473)), and evaluate this architecture on a machine translation task.
```
```
We will build a system that translates from German (our **source language**) to English (our **target language**). The dataset is a collection of parallel English–German sentences taken from translations of subtitles for TED talks. It was derived from the [TED2013](https://opus.nlpl.eu/TED2013-v1.1.php) dataset, which is available in the [OPUS](http://opus.nlpl.eu/) collection. The code cell below prints the first lines in the training data:
```
```
```
> Returns a dictionary that maps the most frequent words in the *sentences* to a contiguous range of integers starting at 0. The first four mappings in this dictionary are reserved for the pseudowords `<pad>` (padding, id 0), `<bos>` (beginning of sequence, id 1), `<eos>` (end of sequence, id 2), and `<unk>` (unknown word, id 3). The parameter *max_size* caps the size of the dictionary, including the pseudowords.
```
The next cell defines a class for the parallel dataset. We sub-class the abstract [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class, which represents map-style datasets in PyTorch. This will let us use standard infrastructure related to the loading and automatic batching of data.
```
```
```
```
The encoder is relatively straightforward. We look up word embeddings and unroll a bidirectional GRU over the embedding vectors to compute a representation at each token position. We then take the last hidden state of the forward GRU and the last hidden state of the backward GRU, concatenate them, and pass them through a linear layer. This produces a summary of the source sentence, which we will later feed into the decoder.
```
> Initialises the encoder. The encoder consists of an embedding layer that maps each of *num_words* words to an embedding vector of size *embedding_dim*, a bidirectional GRU that maps each embedding vector to a position-specific representation of size 2 × *hidden_dim*, and a final linear layer that projects these representationcons to new representations of size *hidden_dim*.
> Takes a tensor *src* with source-language word ids and sends it through the encoder. The input tensor has shape (*batch_size*, *src_len*), where *src_len* is the length of the sentences in the batch. (We will make sure that all sentences in the same batch have the same length.) The method returns a pair of tensors (*output*, *hidden*), where *output* has shape (*batch_size*, *src_len*, *hidden_dim*), and *hidden* has shape (*batch_size*, *hidden_dim*).
Your next task is to implement the attention mechanism. Recall that the purpose of this mechanism is to inform the decoder when generating the translation of the next word. For this, attention has access to the previous hidden state of the decoder, as well as the complete output of the encoder. It returns the attention-weighted sum of the encoder output, the so-called *context* vector. For later usage, we also return the attention weights.
As mentioned in Lecture 3.3, attention can be implemented in various ways. One very simple implementation is *uniform attention*, which assigns equal weight to each position-specific representation in the output of the encoder, and completely ignores the hidden state of the decoder. This mechanism is implemented in the cell below.
```
This equation specifies how to compute the attention score (a scalar) for the previous hidden state of the decoder, denoted by $s_{i-1}$, and the $j$th position-specific representation in the output of the encoder, denoted by $h_j$. The equation refers to three parameters: a vector $v$ and $W$ and $U$. In PyTorch, these parameters can be represented in terms of (bias-free) linear layers that are trained along with the other parameters of the model.
```
> Takes the previous hidden state of the decoder (*decoder_hidden*) and the encoder output (*encoder_output*) and returns a pair (*context*, *alpha*) where *context* is the context as computed as in [Bahdanau et al. (2015)](https://arxiv.org/abs/1409.0473), and *alpha* are the corresponding attention weights. The hidden state has shape (*batch_size*, *hidden_dim*), the encoder output has shape (*batch_size*, *src_len*, *hidden_dim*), the context has shape (*batch_size*, *hidden_dim*), and the attention weights have shape (*batch_size*, *src_len*).
Because the decoder is an autoregressive model, we need to unroll the GRU ‘manually’: At each position, we take the previous hidden state as well as the new input, and apply the GRU for one step. The initial hidden state comes from the encoder. The new input is the embedding of the previous word, concatenated with the context vector from the attention model. To produce the final output, we take the output of the GRU, concatenate the embedding vector and the context vector (residual connection), and feed the result into a linear layer. Here is a graphical representation:
We need to implement this manual unrolling for two very similar tasks: When *training*, both the inputs to and the target outputs of the GRU come from the training data. When *decoding*, the outputs of the GRU are used to generate new target-side words, and these words become the inputs to the next step of the unrolling. We have implemented methods `forward` and `decode` for these two different modes of usage. Your task is to implement a method `step` that takes a single step with the GRU.
```
> Performs a single step in the manual unrolling of the decoder GRU. This takes the output of the encoder (*encoder_output*), the previous hidden state of the decoder (*hidden*), the source mask as described in Problem 2.2 (*src_mask*), and the embedding vector of the previous word (*prev_embedded*), and computes the output as described above.
> The method returns a triple of tensors (*output*, *hidden*, *alpha*) where *output* is the position-specific output of the GRU, of shape (*batch_size*, *num_words*); *hidden* is the new hidden state, of shape (*batch_size*, *hidden_dim*); and *alpha* are the attention weights that were used to compute the *output*, of shape (*batch_size*, *src_len*).
**Batch first!** Per default, the GRU implementation in PyTorch (just as the LSTM implementation) expects its input to be a three-dimensional tensor of the form (*seq_len*, *batch_size*, *input_size*). We find it conceptually easier to change this default behaviour and let the models take their input in the form (*batch_size*, *seq_len*, *input_size*). To do so, set *batch_first=True* when instantiating the GRU.
**Unsqueeze and squeeze.** When doing the unrolling manually, we get the input in the form (*batch_size*, *input_size*). To convert between this representation and the (*batch_size*, *seq_len*, *input_size*) representation, you can use [`unsqueeze`](https://pytorch.org/docs/stable/generated/torch.unsqueeze.html) and [`squeeze`](https://pytorch.org/docs/stable/generated/torch.squeeze.html).
```
```
```
```
```
```
```
As mentioned before, training the translator takes quite a bit of compute power and time. Even with a GPU, you should expect training times per epoch of about 8–10 minutes. The default number of epochs is 2; however, you may want to interrupt the training prematurely and use a partially trained model in case you run out of time.
```
Figure 3 in the paper by [Bahdanau et al. (2015)](https://arxiv.org/abs/1409.0473) shows some heatmaps of attention weights in selected sentences. In the last problem of this lab, we ask you to inspect attention weights for your trained translation system. We define a function `plot_attention` that visualises the attention weights. The *x* axis corresponds to the words in the source sentence (German) and the *y* axis to the generated target sentence (English). The heatmap colours represent the strengths of the attention weights.
```
```
Use these heatmaps to inspect the attention patterns for selected German sentences. Try to find sentences for which the model produces reasonably good English translations. If your German is a bit rusty (or non-existent), use sentences from the validation data. It might be interesting to look at examples where the German and the English word order differ substantially. Document your exploration in a short reflection piece (ca. 150 words). Respond to the following prompts: