Corrected lecture numbers.
Compare changes
+ 1
− 1
```
```
The data for this lab is [WikiText](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/), a collection of more than 100 million tokens extracted from the set of ‘Good’ and ‘Featured’ articles on Wikipedia. We will use the small version of the dataset, which contains slightly more than 2.5 million tokens.
The next cell contains code for an object that will act as a container for the ‘training’ and the ‘validation’ section of the data. We fill this container by reading the corresponding text files. The only processing that we do is to whitespace-tokenize, and to replace each newline character with a special token `<eos>` (end-of-sentence).
```
```
In this section you will implement and train the fixed-window neural language model proposed by [Bengio et al. (2003)](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) and introduced in Lecture 2.3. Recall that an input to the network takes the form of a vector of $n-1$ integers representing the preceding words. Each integer is mapped to a vector via an embedding layer. (All positions share the same embedding.) The embedding vectors are then concatenated and sent through a two-layer feed-forward network with a non-linearity in the form of a rectified linear unit (ReLU) and a final softmax layer.
```
> Transforms WikiText data (a list of word ids) into a pair of tensors $\mathbf{X}$, $\mathbf{y}$ that can be used to train the fixed-window model. Let $N$ be the total number of $n$-grams from the token list; then $\mathbf{X}$ is a matrix with shape $(N, n-1)$ and $\mathbf{y}$ is a vector with length $N$.
```
```
> Creates a new fixed-window neural language model. The argument *n* specifies the model’s $n$-gram order. The argument *n_words* is the number of words that need to be embedded. The optional arguments *embedding_dim* and *hidden_dim* specify the dimensionalities of the embedding layer and the hidden layer, respectively; their default value is 50.
```
```
> Trains a fixed-window neural language model of order *n* using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*. After each epoch, prints the perplexity of the model on the validation data.
```
In this section you will implement the recurrent neural network language model that was presented in Lecture 1.5. Recall that an input to the network is a vector of word ids. Each integer is mapped to an embedding vector. The sequence of embedded vectors is then fed into an unrolled LSTM. At each position $i$ in the sequence, the hidden state of the LSTM at that position is sent through a linear transformation into a final softmax layer, from which we read off the index of the word at position $i+1$. In theory, the input vector could represent the complete training data or at least a complete sentence; for practical reasons, however, we will truncate the input to some fixed value *bptt_len*, the **backpropagation-through-time horizon**.
```
> Transforms a list of token indexes into a pair of tensors $\mathbf{X}$, $\mathbf{Y}$ that can be used to train the recurrent neural language model. The rows of both tensors represent contiguous subsequences of token indexes of length *bptt_len*. Compared to the sequences in $\mathbf{X}$, the corresponding sequences in $\mathbf{Y}$ are shifted one position to the right. More precisely, if the $i$th row of $\mathbf{X}$ is the sequence that starts at token position $j$, then the same row of $\mathbf{Y}$ is the sequence that starts at position $j+1$.
```
```
> Creates a new recurrent neural network language model. The argument *n_words* is the number of words that need to be embedded. The optional arguments *embedding_dim* and *hidden_dim* specify the dimensionalities of the embedding layer and the LSTM hidden layer, respectively; their default value is 50.
The training loop for the recurrent neural network model is essentially identical to the loop that you wrote for the feed-forward model. The only thing to note is that the cross-entropy loss function expects its input to be a two-dimensional tensor; you will therefore have to re-shape the output tensor from the LSTM as well as the gold-standard output tensor in a suitable way. The most efficient way to do so is to use the [`view()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view) method.
```
> Trains a recurrent neural network language model on the WikiText data using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. The parameter *bptt_len* specifies the length of the backpropagation-through-time horizon, that is, the length of the input and output sequences. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*. After each epoch, prints the perplexity of the model on the validation data.
```
Since the error surfaces that gradient search explores when training neural networks can be very complex, it is important to choose ‘good’ initial values for the parameters. In PyTorch, the weights of the embedding layer are initialized by sampling from the standard normal distribution $\mathcal{N}(0, 1)$. Test how changing the standard deviation and/or the distribution affects the perplexity of your feed-forward language model. Write a short report about your experience (ca. 150 words). Use the following prompts: