Fix the argument order in a call to FixedWindowModel

b4a9f372 · Marco Kuhlmann · fc850ac6 · b4a9f372
Commit b4a9f372 authored 1 year ago by Marco Kuhlmann
--- a/baseline/baseline.ipynb
+++ b/baseline/baseline.ipynb
@@ -318,7 +318,7 @@
    "\n",
    "    def __init__(self, vocab_words, vocab_tags, word_dim=50, tag_dim=10, hidden_dim=100):\n",
    "        embedding_specs = [(3, len(vocab_words), word_dim), (1, len(vocab_tags), tag_dim)]\n",
-    "        self.model = FixedWindowModel(embedding_specs, len(vocab_tags), hidden_dim)\n",
+    "        self.model = FixedWindowModel(embedding_specs, hidden_dim, len(vocab_tags))\n",
    "        self.w2i = vocab_words\n",
    "        self.i2t = {i: t for t, i in vocab_tags.items()}\n",
    "\n",
@@ -1022,7 +1022,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.10"
+   "version": "3.10.4"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id: tags:

 # Project Baseline

 %% Cell type:markdown id: tags:

 This notebook walks you through the implementation of the tagger–parser pipeline that serves as the baseline for the [standard project](https://www.ida.liu.se/~TDDE09/project/standard-project.html).

 %% Cell type:markdown id: tags:

 ## Part 1: The data set

 %% Cell type:markdown id: tags:

 The baseline system can work with any treebank released by the [Universal Dependencies Project](http://universaldependencies.org). To read a treebank, we use the [CoNLL-U Parser](https://pypi.org/project/conllu/) library. The code in the next cell defines a PyTorch [Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) wrapper for the data.

 %% Cell type:code id: tags:

 ``` python
 import conllu

 from torch.utils.data import Dataset

 class Treebank(Dataset):

    def __init__(self, filename):
        super().__init__()
        self.items = []
        with open(filename, 'rt', encoding='utf-8') as fp:
            for tokens in conllu.parse_incr(fp):
                sentence = [('[ROOT]', '[ROOT]', 0)]
                for token in tokens.filter(id=lambda x: type(x) is int):
                    sentence.append((token['form'], token['upos'], token['head']))
                self.items.append(sentence)

    def __len__(self):
        return len(self.items)

    def __getitem__(self, idx):
        return self.items[idx]
 ```

 %% Cell type:markdown id: tags:

 We load the training data and the development data from the English Web Treebank. Because the arc-standard algorithm is restricted to projective dependency trees, we need a projectivised version of the training data; this version can be produced using the script `projectivize.py`.

 %% Cell type:code id: tags:

 ``` python
 TRAIN_DATA = Treebank('en_ewt-ud-train-projectivized.conllu')

 DEV_DATA = Treebank('en_ewt-ud-dev.conllu')
 ```

 %% Cell type:markdown id: tags:

 Our data consists of **parsed sentences**. A parsed sentence is represented as a list of triples. The first component of each triple (a string) represents a word. The second component (a string) specifies its part-of-speech tag; the possible tags are listed in the [Annotation Guidelines](http://universaldependencies.org/u/pos/all.html) of the Universal Dependencies Project. The third component of each triple (an integer) specifies the position of the word’s head, i.e., its parent in the dependency tree.

 Run the next cell to see an example sentence:

 %% Cell type:code id: tags:

 ``` python
 TRAIN_DATA[531]
 ```

 %% Cell type:markdown id: tags:

 Note that we prefix each sentence with a special `[ROOT]` token.

 %% Cell type:markdown id: tags:

 ## Part 2: Vocabularies

 %% Cell type:markdown id: tags:

 The baseline uses two vocabularies: one for the words and one for the tags. Both are represented as dictionaries that map words/tags to a contiguous range of integers, starting at zero.

 The next cell contains code for a function `make_vocabs` that constructs the two vocabularies from gold-standard data. The code cell also defines a name for the “unknown word” (`[UNK]`) and for an additional pseudoword that serves as a placeholder for undefined values (`[PAD]`).

 %% Cell type:code id: tags:

 ``` python
 PAD = '[PAD]'
 UNK = '[UNK]'

 PAD_IDX = 0
 UNK_IDX = 1

 def make_vocabs(gold_data):
    vocab_words = {PAD: PAD_IDX, UNK: UNK_IDX}
    vocab_tags = {PAD: PAD_IDX}
    for sentence in gold_data:
        for word, tag, _ in sentence:
            if word not in vocab_words:
                vocab_words[word] = len(vocab_words)
            if tag not in vocab_tags:
                vocab_tags[tag] = len(vocab_tags)
    return vocab_words, vocab_tags
 ```

 %% Cell type:markdown id: tags:

 ## Part 3: Fixed-window model

 %% Cell type:markdown id: tags:

 Both the tagger and the parser of the baseline system use a fixed-window model.

 %% Cell type:markdown id: tags:

 ### Basic structure

 An input to the fixed-window model takes the form of a $k$-dimensional vector of word ids and/or tag ids. Each integer $i$ is mapped to an $e_i$-dimensional embedding vector. These vectors are concatenated to form a vector of length $e_1 + \cdots + e_k$, and sent through a feed-forward network with a single hidden layer followed by a rectified linear unit (ReLU).

 %% Cell type:markdown id: tags:

 ### Embedding specifications

 To make our implementation of the fixed-window model useful for both the tagger and the parser, it can be configured with *embedding specifications*. An embedding specification is a triple $(m, n, e)$ consisting of three integers. Such a triple specifies that the model should set up $m$ instances of an embedding from $n$ items to vectors of size $e$. All of the $m$ instances share their weights. For example, to instantiate the default feature model of the tagger (see below), we initialise the model with the following specifications:

 ``
 [(3, num_words, word_dim), (1, num_tags, tag_dim)]
 ``

 This specifies that the model should use 3 instances of an embedding from *num_words* words to vectors of length *word_dim*, and 1 instance of an embedding from *num_tags* tags to vectors of length *tag_dim*. All 3 instances of the word embedding should share their weights.

 We initialise the weights of each embedding with values drawn from $\mathcal{N}(0, 10^{-2})$.

 %% Cell type:markdown id: tags:

 ### Specification of the fixed-window model

 Here is the specification of the fixed-window model interface:

 **__init__** (*self*, *embedding_specs*, *hidden_dim*, *output_dim*)

 > A fixed-window model is initialised with a list of specifications for the embeddings the network should use (*embedding_specs*), the size of the hidden layer (*hidden_dim*), and the size of the output layer (*output_dim*).

 **forward** (*self*, *features*)

 > Computes the network output for a given feature representation *features*. This is a tensor of shape $B \times k$ where $B$ is the batch size (number of samples in the batch) and $k$ is the total number of embeddings specified upon initialisation. For example, for the default feature model, $k=4$, as this model includes 3 (weight-sharing) word embeddings and 1 tag embedding.

 %% Cell type:code id: tags:

 ``` python
 import torch
 import torch.nn as nn

 class FixedWindowModel(nn.Module):

    def __init__(self, embedding_specs, hidden_dim, output_dim):
        super().__init__()

        # Create the embeddings based on the given specifications
        self.embeddings = nn.ModuleList()
        for n, num_embeddings, embedding_dim in embedding_specs:
            embedding = nn.Embedding(num_embeddings, embedding_dim, padding_idx=0)
            nn.init.normal_(embedding.weight, std=1e-2)
            for i in range(n):
                self.embeddings.append(embedding)

        # Set up the FFN
        input_dim = sum(e.embedding_dim for e in self.embeddings)
        self.pipe = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )

    def forward(self, x):
        embedded = [e(x[..., i]) for i, e in enumerate(self.embeddings)]
        return self.pipe(torch.cat(embedded, -1))
 ```

 %% Cell type:markdown id: tags:

 ## Part 4: Part-of-speech tagger

 %% Cell type:markdown id: tags:

 The tagger is a simple auto-regressive tagger that processes an input sentence from left to right, and at each position, predicts the tag for the current word based on the features extracted from the current feature window.

 %% Cell type:markdown id: tags:

 ### Tagger interface

 The tagger implements a very simple interface:

 **predict** (*self*, *sentence*)

 > Returns the list of predicted tags (a list of strings) for a single *sentence* (a list of string tokens).

 %% Cell type:code id: tags:

 ``` python
 class Tagger(object):

    def predict(self, sentence):
        raise NotImplementedError
 ```

 %% Cell type:markdown id: tags:

 ### Default feature model

 The default feature model of the tagger has the following features ($k=4$):

 0. current word
 1. previous word
 2. next word
 3. tag predicted for the previous word

 Whenever the value of a feature is undefined, we use the special value `[PAD]`.

 %% Cell type:markdown id: tags:

 ### Specification of the implementation

 Here is the specification of the tagger implementation:

 **__init__** (*self*, *vocab_words*, *vocab_tags*, *word_dim* = 50, *tag_dim* = 10, *hidden_dim* = 100)

 > Creates a new fixed-window model of appropriate dimensions and registers the vocabularies. The parameters *vocab_words* and *vocab_tags* are the word vocabulary and tag vocabulary. The parameters *word_dim* and *tag_dim* specify the embedding width for the word embeddings and tag embeddings.

 **featurize** (*self*, *words*, *i*, *pred_tags*)

 > Extracts features from the specified tagger configuration according to the default feature model. The configuration is specified in terms of the words in the input sentence (*words*, a list of word ids), the position of the current word (*i*), and the list of already predicted tags (*pred_tags*, a list of tag ids). Returns a tensor that can be fed to the fixed-window model.

 **predict** (*self*, *words*)

 > Processes the input sentence *words* (a list of string tokens) and makes calls to the fixed-window model to predict the tag of each word. Returns the list of the predicted tags (strings).

 %% Cell type:code id: tags:

 ``` python
 class FixedWindowTagger(Tagger):

    def __init__(self, vocab_words, vocab_tags, word_dim=50, tag_dim=10, hidden_dim=100):
        embedding_specs = [(3, len(vocab_words), word_dim), (1, len(vocab_tags), tag_dim)]
-        self.model = FixedWindowModel(embedding_specs, len(vocab_tags), hidden_dim)
+        self.model = FixedWindowModel(embedding_specs, hidden_dim, len(vocab_tags))
        self.w2i = vocab_words
        self.i2t = {i: t for t, i in vocab_tags.items()}

    def featurize(self, words, i, pred_tags):
        x = torch.zeros(4, dtype=torch.long)
        x[0] = words[i]
        x[1] = words[i - 1] if i > 0 else PAD_IDX
        x[2] = words[i + 1] if i + 1 < len(words) else PAD_IDX
        x[3] = pred_tags[i - 1] if i > 0 else PAD_IDX
        return x

    def predict(self, words):
        words = [self.w2i.get(w, UNK_IDX) for w in words]
        pred_tags = []
        for i in range(len(words)):
            features = self.featurize(words, i, pred_tags)
            with torch.no_grad():
                scores = self.model.forward(features)
            pred_tag = scores.argmax().item()
            pred_tags.append(pred_tag)
        return [self.i2t[i] for i in pred_tags]
 ```

 %% Cell type:markdown id: tags:

 ### Generating the training examples

 To generate the training examples for the tagger, we use the following generator function:

 **training_examples** (*vocab_words*, *vocab_tags*, *gold_data*, *tagger*, *batch_size* = 100)

 > Iterates through the given *gold_data* (an iterable of parsed sentences), encodes it into word ids and tag ids using the specified vocabularies *vocab_words* and *vocab_tags*, and then yields batches of training examples for gradient-based training. Each batch contains *batch_size* examples, except for the last batch, which may contain fewer examples. Each example in the batch is created by a call to the `featurize` function of the *tagger*.

 %% Cell type:code id: tags:

 ``` python
 from collections import Counter

 def training_examples(vocab_words, vocab_tags, gold_data, tagger, batch_size=100, shuffle=False):
    bx = []
    by = []
    for sentence in gold_data:
        # Separate the words and the gold-standard tags
        words, gold_tags, _ = zip(*sentence)

        # Encode words and tags using the vocabularies
        words = [vocab_words.get(w, UNK_IDX) for w in words]
        gold_tags = [vocab_tags[t] for t in gold_tags]

        # Simulate a run of the tagger over the sentence, collecting training examples
        pred_tags = []
        for i, gold_tag in enumerate(gold_tags):
            bx.append(tagger.featurize(words, i, pred_tags))
            by.append(gold_tag)
            if len(bx) >= batch_size:
                bx = torch.stack(bx)
                by = torch.LongTensor(by)
                if shuffle:
                    random_indices = torch.randperm(len(bx))
                    yield bx[random_indices], by[random_indices]
                else:
                    yield bx, by
                bx = []
                by = []
            pred_tags.append(gold_tag)    # teacher forcing!

    # Check whether there is an incomplete batch
    if bx:
        bx = torch.stack(bx)
        by = torch.LongTensor(by)
        if shuffle:
            random_indices = torch.randperm(len(bx))
            yield bx[random_indices], by[random_indices]
        else:
            yield bx, by
 ```

 %% Cell type:markdown id: tags:

 ### Training loop

 Training the tagger uses a straightforward training loop.

 **train_tagger** (*train_data*, *n_epochs* = 1, *batch_size* = 100, *lr* = 1e-2)

 > Trains a fixed-window tagger from a set of training data *train_data* (an iterable over parsed sentences) using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*.

 %% Cell type:code id: tags:

 ``` python
 import torch.nn.functional as F
 import torch.optim as optim

 from tqdm import tqdm

 def train_tagger(train_data, n_epochs=1, batch_size=100, lr=1e-2):
    # Create the vocabularies
    vocab_words, vocab_tags = make_vocabs(train_data)

    # Instantiate the tagger
    tagger = FixedWindowTagger(vocab_words, vocab_tags)

    # Instantiate the optimizer
    optimizer = optim.Adam(tagger.model.parameters(), lr=lr)

    # Training loop
    for epoch in range(n_epochs):
        running_loss = 0
        n_examples = 0
        with tqdm(total=sum(len(s) for s in train_data)) as pbar:
            for bx, by in training_examples(vocab_words, vocab_tags, train_data, tagger):
                optimizer.zero_grad()
                output = tagger.model.forward(bx)
                loss = F.cross_entropy(output, by)
                loss.backward()
                optimizer.step()
                running_loss += loss.item()
                n_examples += 1
                pbar.set_postfix(loss=running_loss/n_examples)
                pbar.update(len(bx))

    return tagger
 ```

 %% Cell type:markdown id: tags:

 ### Evaluation function

 To evaluate a tagger, we compute its per-token accuracy.

 **accuracy** (*tagger*, *gold_data*)

 > Computes the accuracy of the *tagger* on the gold-standard data *gold_data* (an iterable of parsed sentences) and returns it as a float. Recall that the accuracy is defined as the percentage of tokens to which the tagger assigns the correct tag (as per the gold standard). The calculation ignores the pseudo-root.

 %% Cell type:code id: tags:

 ``` python
 def accuracy(tagger, gold_data):
    correct = 0
    total = 0
    for sentence in gold_data:
        words, gold_tags, _ = zip(*sentence)
        pred_tags = tagger.predict(words)
        for gold_tag, pred_tag in zip(gold_tags[1:], pred_tags[1:]):  # ignore the pseudo-root
            correct += int(gold_tag == pred_tag)
            total += 1
    return correct / total
 ```

 %% Cell type:markdown id: tags:

 ### Putting everything together

 The next code cell trains a tagger and evaluates it on the development data:

 %% Cell type:code id: tags:

 ``` python
 TAGGER = train_tagger(TRAIN_DATA)
 print('{:.4f}'.format(accuracy(TAGGER, DEV_DATA)))
 ```

 %% Cell type:markdown id: tags:

 The tagging accuracy on the development data should be around 88%.

 %% Cell type:markdown id: tags:

 ## Part 5: Parser

 %% Cell type:markdown id: tags:

 The parser part of the baseline system is a dependency parser based on the arc-standard algorithm. It consists of two parts: one that implements the algorithm logic and one that encapsulates the learning component – the fixed-window model. The parser uses the fixed-window model to predict the next move for a given configuration in the arc-standard algorithm, based on the features extracted from the current feature window.

 %% Cell type:markdown id: tags:

 ### Parser interface

 Like the tagger, the parser has a very simple interface:

 %% Cell type:code id: tags:

 ``` python
 class Parser(object):

    def predict(self, words, tags):
        raise NotImplementedError
 ```

 %% Cell type:markdown id: tags:

 The single method of this interface has the following specification:

 **predict** (*self*, *words*, *tags*)

 > Returns the list of predicted heads (a list of integers) for a single sentence, specified in terms of its *words* (a list of strings) and their corresponding *tags* (also a list of strings).

 %% Cell type:markdown id: tags:

 ### Default feature model

 For the parser, we will use the following features ($k=6$):

 0. word form of the next word in the buffer
 1. word form of the topmost word on the stack
 2. word form of the second-topmost word on the stack
 3. part-of-speech tag of the next word in the buffer
 4. part-of-speech tag of the topmost word on the stack
 5. part-of-speech tag of the second-topmost word on the stack

 Whenever the value of a feature is undefined, you should use the special value `PAD`.

 %% Cell type:markdown id: tags:

 ### Arc-standard algorithm

 Recall that, in the arc-standard algorithm, the next move (also called “transition”) of the parser is predicted based on features extracted from the current parser configuration, with references to the words and part-of-speech tags of the input sentence. On the Python side of things, the words and part-of-speech tags are represented as lists of strings, and a configuration is represented as a triple

 $$
 (i, \mathit{stack}, \mathit{heads})
 $$

 where $i$ is an integer specifying the position of the next word in the buffer, $\mathit{stack}$ is a list of integers specifying the positions of the words currently on the stack (with the topmost element last in the list), and $\mathit{heads}$ is a list of integers specifying the positions of the head words. If a word has not yet been assigned a head, its head value is&nbsp;0. To illustrate this representation, the initial configuration for the example sentence is

 (0, [], [0, 0, 0, 0, 0, 0])

 and a possible final configuration is

 (6, [0], [0, 2, 0, 4, 2, 2])

 In the lecture, both the buffer and the stack were presented as list of words. Here we only represent the *stack* as a list of words. To represent the *buffer*, we simply record the position of the next word that has not been processed yet (the integer $i$). This acknowledges the fact that the buffer (in contrast to the stack) can never grow, but will be processed from left to right.

 Here is the specification of the implementation of the algorithmic part of the parser:

 **initial_config** (*num_words*)

 > Returns the initial configuration for a sentence with the specified number of words (*num_words*).

 **valid_moves** (*config*)

 > Returns the list of valid moves for the specified configuration (*config*).

 **next_config** (*config*, *move*)

 > Applies the *move* in the specified configuration *config* and returns the new configuration. This must not modify the input configuration.

 **is_final_config** (*config*)

 > Tests whether *config* is a final configuration.

 %% Cell type:code id: tags:

 ``` python
 class ArcStandardParser(Parser):

    MOVES = tuple(range(3))

    SH, LA, RA = MOVES

    @staticmethod
    def initial_config(num_words):
        return 0, [], [0] * num_words

    @staticmethod
    def valid_moves(config):
        pos, stack, heads = config
        moves = []
        if pos < len(heads):
            moves.append(ArcStandardParser.SH)
        if len(stack) >= 3:  # disallow LA with root as dependent
            moves.append(ArcStandardParser.LA)
        if len(stack) >= 2:
            moves.append(ArcStandardParser.RA)
        return moves

    @staticmethod
    def next_config(config, move):
        pos, stack, heads = config
        stack = list(stack)  # copy because we will modify it
        if move == ArcStandardParser.SH:
            stack.append(pos)
            pos += 1
        else:
            heads = list(heads)  # copy because we will modify it
            s1 = stack.pop()
            s2 = stack.pop()
            if move == ArcStandardParser.LA:
                heads[s2] = s1
                stack.append(s1)
            if move == ArcStandardParser.RA:
                heads[s1] = s2
                stack.append(s2)
        return pos, stack, heads

    @staticmethod
    def is_final_config(config):
        pos, stack, heads = config
        return pos == len(heads) and len(stack) == 1
 ```

 %% Cell type:markdown id: tags:

 ### Specification of the implementation

 Here is the specification of the parser implementation:

 **__init__** (*self*, *vocab_words*, *vocab_tags*, *word_dim* = 50, *tag_dim* = 10, *hidden_dim* = 100)

 > Creates a new fixed-window model of appropriate dimensions and sets up any other data structures that you consider relevant. The parameters *vocab_words* and *vocab_tags* are the word vocabulary and tag vocabulary. The parameters *word_dim* and *tag_dim* specify the embedding width for the word embeddings and tag embeddings.

 **featurize** (*self*, *words*, *tags*, *config*)

 > Extracts features from the specified parser state according to the feature model given above. The state is specified in terms of the words in the input sentence (*words*, a list of word ids), their part-of-speech tags (*tags*, a list of tag ids), and the parser configuration proper (*config*, as specified in Problem&nbsp;3).

 **predict** (*self*, *words*, *tags*)

 > Predicts the list of all heads for the input sentence. This simulates the arc-standard algorithm, calling the move classifier whenever it needs to take a decision. The input sentence is specified in terms of the list of its words (strings) and the list of its tags (strings). Both of these should include the pseudoroot.

 %% Cell type:code id: tags:

 ``` python
 class FixedWindowParser(ArcStandardParser):

    def __init__(self, vocab_words, vocab_tags, word_dim=50, tag_dim=10, hidden_dim=180):
        embedding_specs = [(3, len(vocab_words), word_dim), (3, len(vocab_tags), tag_dim)]
        self.model = FixedWindowModel(embedding_specs, hidden_dim, len(ArcStandardParser.MOVES))
        self.w2i = vocab_words
        self.t2i = vocab_tags

    def featurize(self, words, tags, config):
        i, stack, heads = config
        x = torch.zeros(6, dtype=torch.long)
        x[0] = words[i] if i < len(words) else PAD_IDX
        x[1] = words[stack[-1]] if len(stack) >= 1 else PAD_IDX
        x[2] = words[stack[-2]] if len(stack) >= 2 else PAD_IDX
        x[3] = tags[i] if i < len(tags) else PAD_IDX
        x[4] = tags[stack[-1]] if len(stack) >= 1 else PAD_IDX
        x[5] = tags[stack[-2]] if len(stack) >= 2 else PAD_IDX
        return x

    def predict(self, words, tags):
        words = [self.w2i.get(w, UNK_IDX) for w in words]
        tags = [self.t2i.get(t, UNK_IDX) for t in tags]
        config = self.initial_config(len(words))
        valid_moves = self.valid_moves(config)
        while valid_moves:
            features = self.featurize(words, tags, config)
            with torch.no_grad():
                scores = self.model.forward(features)

            # We may only predict valid transitions
            best_score, pred_move = float('-inf'), None
            for move in valid_moves:
                if scores[move] > best_score:
                    best_score, pred_move = scores[move], move

            config = self.next_config(config, pred_move)
            valid_moves = self.valid_moves(config)
        i, stack, pred_heads = config
        return pred_heads
 ```

 %% Cell type:markdown id: tags:

 ### Oracle

 The learning component of the parser is the next move classifier. To train this classifier, we need training examples of the form $(\mathbf{x}, m)$, where $\mathbf{x}$ is a feature vector extracted from a given parser configuration $c$, and $m$ is the corresponding gold-standard move. To obtain $m$, we need an **oracle**.

 Recall that, in the context of transition-based dependency parsing, an oracle is a function that translates a gold-standard dependency tree (here represented as a list of head ids) into a sequence of moves such that, when the parser takes the moves starting from the initial configuration, then it recreates the original dependency tree.

 Here is the formal specification of the oracle:

 **oracle_moves** (*gold_heads*)

 > Translates a gold-standard head assignment for a single sentence (*gold_heads*) into the corresponding stream of oracle moves. More specifically, this yields pairs $(c, m)$ where $m$ is a move (an integer, as specified in the `ArcStandardParser` interface) and $c$ is the parser configuration in which $m$ was taken.

 %% Cell type:code id: tags:

 ``` python
 def oracle_moves(gold_heads):
    # Keep track of how many dependents each head still needs to find
    remaining_count = [0] * len(gold_heads)
    for node in gold_heads:
        remaining_count[node] += 1

    # Simulate a parser
    config = ArcStandardParser.initial_config(len(gold_heads))
    while not ArcStandardParser.is_final_config(config):
        pos, stack, heads = config
        if len(stack) >= 2:
            s1 = stack[-1]
            s2 = stack[-2]
            if gold_heads[s2] == s1 and remaining_count[s2] == 0:
                move = ArcStandardParser.LA
                yield config, move
                config = ArcStandardParser.next_config(config, move)
                remaining_count[s1] -= 1
                continue
            if gold_heads[s1] == s2 and remaining_count[s1] == 0:
                move = ArcStandardParser.RA
                yield config, move
                config = ArcStandardParser.next_config(config, move)
                remaining_count[s2] -= 1
                continue
        move = ArcStandardParser.SH
        yield config, move
        config = ArcStandardParser.next_config(config, move)
 ```

 %% Cell type:markdown id: tags:

 ### Generating the training examples

 This time, we are generating the training examples for the parser:

 **training_examples** (*vocab_words*, *vocab_tags*, *gold_data*, *parser*, *batch_size* = 100)

 > Iterates through the given *gold_data* (an iterable of parsed sentences), encodes it into word ids and tag ids using the specified vocabularies *vocab_words* and *vocab_tags*, and then yields batches of training examples for gradient-based training. Each batch contains *batch_size* examples, except for the last batch, which may contain fewer examples. Each example in the batch is created by a call to the `featurize` function of the *parser*.

 %% Cell type:code id: tags:

 ``` python
 def training_examples(vocab_words, vocab_tags, gold_data, parser, batch_size=100):
    bx = []
    by = []

    for sentence in gold_data:
        # Separate the words, gold tags, and gold heads
        words, tags, gold_heads = zip(*sentence)

        # Encode words and tags using the vocabularies
        words = [vocab_words.get(w, UNK_IDX) for w in words]
        tags = [vocab_tags[t] for t in tags]

        # Call the oracle
        for config, gold_move in oracle_moves(gold_heads):
            bx.append(parser.featurize(words, tags, config))
            by.append(gold_move)
            if len(bx) >= batch_size:
                bx = torch.stack(bx)
                by = torch.LongTensor(by)
                yield bx, by
                bx = []
                by = []

    # Check whether there is an incomplete batch
    if bx:
        bx = torch.stack(bx)
        by = torch.LongTensor(by)
        yield bx, by
 ```

 %% Cell type:markdown id: tags:

 ### Training loop

 The training loop is straightforward:

 **train_parser** (*train_data*, *n_epochs* = 1, *batch_size* = 100, *lr* = 1e-2)

 > Trains a fixed-window parser from a set of training data *train_data* (an iterable over parsed sentences) using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*.

 %% Cell type:code id: tags:

 ``` python
 import torch.nn.functional as F
 import torch.optim as optim

 from tqdm import tqdm

 def train_parser(train_data, n_epochs=1, batch_size=100, lr=1e-2):
    # Create the vocabularies
    vocab_words, vocab_tags = make_vocabs(train_data)

    # Instantiate the parser
    parser = FixedWindowParser(vocab_words, vocab_tags)

    # Instantiate the optimizer
    optimizer = optim.Adam(parser.model.parameters(), lr=lr)

    # Training loop
    for epoch in range(n_epochs):
        running_loss = 0
        n_examples = 0
        with tqdm(total=sum(2*len(s)-1 for s in train_data)) as pbar:
            for bx, by in training_examples(vocab_words, vocab_tags, train_data, parser):
                optimizer.zero_grad()
                output = parser.model.forward(bx)
                loss = F.cross_entropy(output, by)
                loss.backward()
                optimizer.step()
                running_loss += loss.item()
                n_examples += 1
                pbar.set_postfix(loss=running_loss/n_examples)
                pbar.update(len(bx))

    return parser
 ```

 %% Cell type:markdown id: tags:

 ### Evaluation function

 To evaluate a parser, we compute its unlabelled attachment score on gold-standard data.

 **uas** (*parser*, *gold_data*)

 > Computes the unlabelled attachment score of the specified *parser* on the gold-standard data *gold_data* (an iterable of parsed sentences) and returns it as a float. The unlabelled attachment score is the percentage of all tokens to which the parser assigns the correct head (as per the gold standard). The calculation ignores the pseudo-roots.

 %% Cell type:code id: tags:

 ``` python
 def uas(parser, gold_sentences):
    correct = 0
    total = 0
    for sentence in gold_sentences:
        words, tags, gold_heads = zip(*sentence)
        pred_heads = parser.predict(words, tags)
        for gold, pred in zip(gold_heads[1:], pred_heads[1:]):  # ignore the pseudo-root
            correct += int(gold == pred)
            total += 1
    return correct / total
 ```

 %% Cell type:markdown id: tags:

 ### Putting everything together

 The next code cell trains a tagger and evaluates it on the development data:

 %% Cell type:code id: tags:

 ``` python
 PARSER = train_parser(TRAIN_DATA, n_epochs=1)
 print('{:.4f}'.format(uas(PARSER, DEV_DATA)))
 ```

 %% Cell type:markdown id: tags:

 The unlabelled attachment score on the development data (with gold-standard tags) should be around 70%.

 %% Cell type:markdown id: tags:

 ## Part 6: Final evaluation

 %% Cell type:markdown id: tags:

 For the final evaluation, we chain the tagger and the parser into a pipeline: The tags predicted by the tagger become the input tags to the parser.

 %% Cell type:code id: tags:

 ``` python
 def evaluate(tagger, parser, gold_sentences):
    correct_tagger = 0
    total_tagger = 0
    correct_parser = 0
    total_parser = 0
    for sentence in gold_sentences:
        words, gold_tags, gold_heads = zip(*sentence)
        pred_tags = tagger.predict(words)
        for gold, pred in zip(gold_tags[1:], pred_tags[1:]):
            correct_tagger += int(gold == pred)
            total_tagger += 1
        pred_heads = parser.predict(words, pred_tags)
        for gold, pred in zip(gold_heads[1:], pred_heads[1:]):
            correct_parser += int(gold == pred)
            total_parser += 1
    return correct_tagger / total_tagger, correct_parser / total_parser
 ```

 %% Cell type:code id: tags:

 ``` python
 acc, uas = evaluate(TAGGER, PARSER, DEV_DATA)
 print('acc: {:.4f}, uas: {:.4f}'.format(acc, uas))
 ```

 %% Cell type:markdown id: tags:

 The tagging accuracy and unlabelled attachment score on the development data should be around 88% and 65%, respectively.