Add the accuracy function

77b3d867 · Marco Kuhlmann · b4a9f372 · 77b3d867
Commit 77b3d867 authored 1 year ago by Marco Kuhlmann
--- a/challenge/challenge.ipynb
+++ b/challenge/challenge.ipynb
@@ -262,6 +262,24 @@
    "The following function that computes the accuracy of the tagger on gold-standard data."
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def accuracy(tagger, gold_data):\n",
+    "    correct = 0\n",
+    "    total = 0\n",
+    "    for tagged_sentence in gold_data:\n",
+    "        words, gold_tags = zip(*tagged_sentence)\n",
+    "        pred_tags = tagger.predict(words)\n",
+    "        for gold_tag, pred_tag in zip(gold_tags, pred_tags):\n",
+    "            correct += int(gold_tag == pred_tag)\n",
+    "            total += 1\n",
+    "    return correct / total"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -362,7 +380,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.10"
+   "version": "3.10.4"
  },
  "latex_envs": {
   "bibliofile": "biblio.bib",

 %% Cell type:markdown id: tags:

 # Feature engineering for part-of-speech tagging

 %% Cell type:markdown id: tags:

 In this challenge, you will practice your skills in feature engineering, the task of identifying useful features for a machine learning system.

 %% Cell type:markdown id: tags:

 ## The data set

 %% Cell type:markdown id: tags:

 The data for this challenge and their representation is the same as for lab L4 on dependency parsing, but for this lab we have converted the data into a simpler format: words and their part-of-speech tags are separated by tabs, sentences are separated by empty lines. The code in the next cell defines a container class for data with this format.

 %% Cell type:code id: tags:

 ``` python
 class Dataset():

    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        tmp = []
        with open(self.filename, 'rt', encoding='utf-8') as lines:
            for line in lines:
                line = line.rstrip()
                if line:
                    tmp.append(tuple(line.split('\t')))
                else:
                    yield tmp
                    tmp = []
 ```

 %% Cell type:markdown id: tags:

 We load the training data and the development data for this lab:

 %% Cell type:code id: tags:

 ``` python
 train_data = Dataset('train.txt')
 dev_data = Dataset('dev.txt')
 ```

 %% Cell type:markdown id: tags:

 Both data sets consist of **tagged sentences**. On the Python side of things, a tagged sentence is represented as a list of string pairs, where the first component of each pair represents a word token and the second component represents the word’s tag. The possible tags are listed and exemplified in the [Annotation Guidelines](http://universaldependencies.org/u/pos/all.html) of the Universal Dependencies Project.

 %% Cell type:markdown id: tags:

 ## Baseline tagger

 %% Cell type:markdown id: tags:

 The baseline tagger that you will use in this lab is a pure Python implementation of the perceptron tagger that was presented in the video lectures on [Part-of-speech tagging with the perceptron](https://web.microsoftstream.com/video/1a846d64-57e9-41ba-a3ca-3ab74cf32039) and [The perceptron learning algorithm](https://web.microsoftstream.com/video/1a69c6d6-35e1-42dc-81ba-8b5e52208831). To understand what the code provided here does, and how it might be extended with new features, you should watch these two lectures.

 %% Cell type:markdown id: tags:

 ### Linear model

 %% Cell type:code id: tags:

 ``` python
 from collections import defaultdict

 class Linear(object):

    def __init__(self, classes):
        self.classes = sorted(classes)
        self.weight = {c: defaultdict(float) for c in self.classes}
        self.bias = {c: 0.0 for c in self.classes}

    def forward(self, features):
        scores = {}
        for c in self.classes:
            scores[c] = self.bias[c]
            for f, v in features.items():
                scores[c] += v * self.weight[c][f]
        return scores
 ```

 %% Cell type:markdown id: tags:

 ### Perceptron learning algorithm

 %% Cell type:code id: tags:

 ``` python
 class PerceptronTrainer(object):

    def __init__(self, model):
        self.model = model
        self._acc = Linear(model.classes)
        self._counter = 1

    def update(self, features, gold):
        scores = self.model.forward(features)
        pred = max(self.model.classes, key=lambda c: scores[c])
        if pred != gold:
            self.model.bias[gold] += 1
            self.model.bias[pred] -= 1
            self._acc.bias[gold] += self._counter
            self._acc.bias[pred] -= self._counter
            for f, v in features.items():
                self.model.weight[gold][f] += v
                self.model.weight[pred][f] -= v
                self._acc.weight[gold][f] += v * self._counter
                self._acc.weight[pred][f] -= v * self._counter
        self._counter += 1

    def finalize(self):
        for c in self.model.classes:
            delta_b = self._acc.bias[c] / self._counter
            self.model.bias[c] -= delta_b
            for feat in self.model.weight[c]:
                delta_w = self._acc.weight[c][feat] / self._counter
                self.model.weight[c][feat] -= delta_w
 ```

 %% Cell type:markdown id: tags:

 ### Perceptron tagger

 %% Cell type:markdown id: tags:

 This is the part of the code that you will have to modify.

 %% Cell type:code id: tags:

 ``` python
 class PerceptronTagger(object):

    def __init__(self, tags):
        self.model = Linear(tags)

    def featurize(self, words, i, pred_tags):
        # TODO: This is the only method that you are allowed to change!
        feats = []
        feats.append(words[i])
        feats.append(words[i-1] if i > 0 else '<bos>')
        feats.append(words[i+1] if i + 1 < len(words) else '<eos>')
        feats.append(pred_tags[i-1] if i > 0 else '<bos>')
        return {(i, f): 1 for i, f in enumerate(feats)}

    def predict(self, words):
        pred_tags = []
        for i, _ in enumerate(words):
            features = self.featurize(words, i, pred_tags)
            scores = self.model.forward(features)
            pred_tag = max(self.model.classes, key=lambda c: scores[c])
            pred_tags.append(pred_tag)
        return pred_tags
 ```

 %% Cell type:markdown id: tags:

 ### Training loop

 %% Cell type:code id: tags:

 ``` python
 from tqdm import tqdm

 def train_perceptron(train_data, n_epochs=1):
    # Collect the tags in the training data
    tags = set()
    for tagged_sentence in train_data:
        words, gold_tags = zip(*tagged_sentence)
        tags.update(gold_tags)

    # Initialise and train the perceptron tagger
    tagger = PerceptronTagger(tags)
    trainer = PerceptronTrainer(tagger.model)
    for epoch in range(n_epochs):
        with tqdm(total=sum(1 for s in train_data)) as pbar:
            for tagged_sentence in train_data:
                words, gold_tags = zip(*tagged_sentence)
                pred_tags = []
                for i, gold_tag in enumerate(gold_tags):
                    features = tagger.featurize(words, i, pred_tags)
                    trainer.update(features, gold_tag)
                    pred_tags.append(gold_tag)
                pbar.update()
    trainer.finalize()

    return tagger
 ```

 %% Cell type:markdown id: tags:

 ## Evaluation

 %% Cell type:markdown id: tags:

 The following function that computes the accuracy of the tagger on gold-standard data.

+%% Cell type:code id: tags:
+
+``` python
+def accuracy(tagger, gold_data):
+    correct = 0
+    total = 0
+    for tagged_sentence in gold_data:
+        words, gold_tags = zip(*tagged_sentence)
+        pred_tags = tagger.predict(words)
+        for gold_tag, pred_tag in zip(gold_tags, pred_tags):
+            correct += int(gold_tag == pred_tag)
+            total += 1
+    return correct / total
+```
+
 %% Cell type:markdown id: tags:

 ## Feature engineering

 %% Cell type:markdown id: tags:

 Your task now is to try to improve the performance of the perceptron tagger by adding new features. The only part of the code that you are allowed to change is the `featurize` method. Provide a short (ca. 150&nbsp;words) report on what features you added and what results you obtained.

 %% Cell type:markdown id: tags:

 **⚠️ To claim the bonus points for this challenge, your submitted notebook must contain output demonstrating at least 91% accuracy on the development set.**

 %% Cell type:code id: tags:

 ``` python
 tagger = train_perceptron(train_data, n_epochs=3)
 print('{:.4f}'.format(accuracy(tagger, dev_data)))
 ```

 %% Cell type:markdown id: tags:

 *TODO: Insert your report here*

 %% Cell type:markdown id: tags:

 ## Chocolate Box Challenge

 %% Cell type:markdown id: tags:

 To participate in the [Chocolate Box Challenge](https://www.kaggle.com/t/ce6f010bfb99478fb20830735c6f4a03), run the next code cell to produce a file `submission.csv` and upload this file to Kaggle.

 %% Cell type:code id: tags:

 ``` python
 # Load the test data (without the tags)
 test_data = Dataset('test-notags.txt')

 # Generate submission.csv with results on both the dev data and the test data
 with open('submission.csv', 'w') as target:
    target.write('Id,Tag\n')
    for p, data in [('D', dev_data), ('T', test_data)]:
        for i, tagged_sentence in enumerate(data):
            words, _ = zip(*tagged_sentence)
            predicted_tags = tagger.predict(words)
            for j, tag in enumerate(predicted_tags):
                target.write('{}-{:04d}-{:04d},{}\n'.format(p, i, j, tag))
 ```

 %% Cell type:markdown id: tags:

 Please observe the following rules for the Chocolate Box Challenge:

 > The point of the challenge is to come up with interesting features. You are not allowed to change the tagger in any other way.

 Good luck, and may the best team win! 🙂