diff --git a/labs/l3/NLP-L3.ipynb b/labs/l3/NLP-L3.ipynb index 9e38b431a9abe6a7b57cfc852765c88a1b153f72..960a939b6f376619c6c0e1a238e5a4f48035fe6b 100644 --- a/labs/l3/NLP-L3.ipynb +++ b/labs/l3/NLP-L3.ipynb @@ -488,54 +488,6 @@ "> Takes the previous hidden state of the decoder (*decoder_hidden*) and the encoder output (*encoder_output*) and returns a pair (*context*, *alpha*) where *context* is the context as computed as in [Bahdanau et al. (2015)](https://arxiv.org/abs/1409.0473), and *alpha* are the corresponding attention weights. The hidden state has shape (*batch_size*, *hidden_dim*), the encoder output has shape (*batch_size*, *src_len*, *hidden_dim*), the context has shape (*batch_size*, *hidden_dim*), and the attention weights have shape (*batch_size*, *src_len*)." ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qCOXKkVJE2Yl" - }, - "outputs": [], - "source": [ - "import math\n", - "\n", - "class ScaledDotProductAttention(nn.Module):\n", - "\n", - " def __init__(self, hidden_dim=512):\n", - " super().__init__()\n", - "\n", - " def forward(self, decoder_hidden, encoder_output, src_mask):\n", - " # decoder_hidden: [batch_size, hidden_dim]\n", - " # encoder_output: [batch_size, src_len, hidden_dim]\n", - " # src_mask: [batch_size, src_len]\n", - "\n", - " # Prepare the hidden state for broadcasting\n", - " decoder_hidden = decoder_hidden.unsqueeze(1)\n", - " # decoder_hidden: [batch_size, 1, hidden_dim]\n", - "\n", - " # Compute the attention scores\n", - " scores = decoder_hidden @ encoder_output.transpose(-1, -2)\n", - " # scores: [batch_size, 1, src_len]\n", - " scores = scores.squeeze(-2)\n", - " # scores: [batch_size, src_len]\n", - "\n", - " # Normalise\n", - " scores /= math.sqrt(encoder_output.size(-1))\n", - "\n", - " # Mask out the attention scores for the padding tokens. We set\n", - " # them to -inf. After the softmax, we will have 0.\n", - " scores.data.masked_fill_(~src_mask, -float('inf'))\n", - "\n", - " # Convert scores into weights\n", - " alpha = F.softmax(scores, dim=1)\n", - " # alpha: [batch_size, src_len]\n", - "\n", - " # The context vector is the alpha-weighted sum of the encoder outputs.\n", - " context = torch.bmm(alpha.unsqueeze(1), encoder_output).squeeze(1)\n", - " # context: [batch_size, encoder_hidden_dim]\n", - "\n", - " return context, alpha" - ] - }, { "cell_type": "markdown", "metadata": { @@ -574,7 +526,7 @@ "\n", "Because the decoder is an autoregressive model, we need to unroll the GRU ‘manually’: At each position, we take the previous hidden state as well as the new input, and apply the GRU for one step. The initial hidden state comes from the encoder. The new input is the embedding of the previous word, concatenated with the context vector from the attention model. To produce the final output, we take the output of the GRU, concatenate the embedding vector and the context vector (residual connection), and feed the result into a linear layer. Here is a graphical representation:\n", "\n", - "<img src=\"https://gitlab.liu.se/nlp/nlp-course/-/raw/master/labs/l5/decoder.svg\" width=\"50%\" alt=\"Decoder architecture\"/>\n", + "<img src=\"https://gitlab.liu.se/nlp/nlp-course/-/raw/master/labs/l3/decoder.svg\" width=\"50%\" alt=\"Decoder architecture\"/>\n", "\n", "We need to implement this manual unrolling for two very similar tasks: When *training*, both the inputs to and the target outputs of the GRU come from the training data. When *decoding*, the outputs of the GRU are used to generate new target-side words, and these words become the inputs to the next step of the unrolling. We have implemented methods `forward` and `decode` for these two different modes of usage. Your task is to implement a method `step` that takes a single step with the GRU." ] @@ -827,7 +779,7 @@ }, "outputs": [], "source": [ - "translator = Translator(src_vocab, tgt_vocab, ScaledDotProductAttention())\n", + "translator = Translator(src_vocab, tgt_vocab, BahdanauAttention())\n", "translator.translate(['ich weiß nicht .', 'das haus ist klein .'])" ] },