Add lab L3X

3e2a8d9c · Marco Kuhlmann · e5580ada · 3e2a8d9c · 3e2a8d9c · 3e2a8d9c
Commit 3e2a8d9c authored 4 years ago by Marco Kuhlmann
--- a/labs/l3x/NLP-L3X.ipynb
+++ b/labs/l3x/NLP-L3X.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# L3X: Feature engineering for part-of-speech tagging"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this lab, you will practice your skills in feature engineering, the task of identifying useful features for a machine learning system."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The data set"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The data for this lab and their representation is the same as for the basic lab."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class Dataset():\n",
+    "\n",
+    "    def __init__(self, filename):\n",
+    "        self.filename = filename\n",
+    "\n",
+    "    def __iter__(self):\n",
+    "        tmp = []\n",
+    "        with open(self.filename, 'rt', encoding='utf-8') as lines:\n",
+    "            for line in lines:\n",
+    "                line = line.rstrip()\n",
+    "                if line:\n",
+    "                    tmp.append(tuple(line.split('\\t')))\n",
+    "                else:\n",
+    "                    yield tmp\n",
+    "                    tmp = []"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We load the training data and the development data for this lab:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_data = Dataset('train.txt')\n",
+    "dev_data = Dataset('dev.txt')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Baseline tagger"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The baseline tagger that you will use in this lab is a pure Python implementation of the perceptron tagger that was presented in Lecture&nbsp;3.3 and Lecture&nbsp;3.4. To understand what the code provided here does, and how it might be extended with new features, you should watch these two lectures."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Linear model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from collections import defaultdict\n",
+    "\n",
+    "class Linear(object):\n",
+    "\n",
+    "    def __init__(self, classes):\n",
+    "        self.classes = sorted(classes)\n",
+    "        self.weight = {c: defaultdict(float) for c in self.classes}\n",
+    "        self.bias = {c: 0.0 for c in self.classes}\n",
+    "\n",
+    "    def forward(self, features):\n",
+    "        scores = {}\n",
+    "        for c in self.classes:\n",
+    "            scores[c] = self.bias[c]\n",
+    "            for f, v in features.items():\n",
+    "                scores[c] += v * self.weight[c][f]\n",
+    "        return scores"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Perceptron learning algorithm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class PerceptronTrainer(object):\n",
+    "\n",
+    "    def __init__(self, model):\n",
+    "        self.model = model\n",
+    "        self._acc = Linear(model.classes)\n",
+    "        self._counter = 1\n",
+    "\n",
+    "    def update(self, features, gold):\n",
+    "        scores = self.model.forward(features)\n",
+    "        pred = max(self.model.classes, key=lambda c: scores[c])\n",
+    "        if pred != gold:\n",
+    "            self.model.bias[gold] += 1\n",
+    "            self.model.bias[pred] -= 1\n",
+    "            self._acc.bias[gold] += self._counter\n",
+    "            self._acc.bias[pred] -= self._counter\n",
+    "            for f, v in features.items():\n",
+    "                self.model.weight[gold][f] += v\n",
+    "                self.model.weight[pred][f] -= v\n",
+    "                self._acc.weight[gold][f] += v * self._counter\n",
+    "                self._acc.weight[pred][f] -= v * self._counter\n",
+    "        self._counter += 1\n",
+    "\n",
+    "    def finalize(self):\n",
+    "        for c in self.model.classes:\n",
+    "            delta_b = self._acc.bias[c] / self._counter\n",
+    "            self.model.bias[c] -= delta_b\n",
+    "            for feat in self.model.weight[c]:\n",
+    "                delta_w = self._acc.weight[c][feat] / self._counter\n",
+    "                self.model.weight[c][feat] -= delta_w"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Perceptron tagger"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is the part of the code that you will have to modify."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class PerceptronTagger(object):\n",
+    "\n",
+    "    def __init__(self, tags):\n",
+    "        self.model = Linear(tags)\n",
+    "\n",
+    "    def featurize(self, words, i, pred_tags):\n",
+    "        # TODO: This is the only method that you are allowed to change!\n",
+    "        feats = []\n",
+    "        feats.append(words[i])\n",
+    "        feats.append(words[i-1] if i > 0 else '<bos>')\n",
+    "        feats.append(words[i+1] if i + 1 < len(words) else '<eos>')\n",
+    "        feats.append(pred_tags[i-1] if i > 0 else '<bos>')\n",
+    "        return {(i, f): 1 for i, f in enumerate(feats)}\n",
+    "\n",
+    "    def predict(self, words):\n",
+    "        pred_tags = []\n",
+    "        for i, _ in enumerate(words):\n",
+    "            features = self.featurize(words, i, pred_tags)\n",
+    "            scores = self.model.forward(features)\n",
+    "            pred_tag = max(self.model.classes, key=lambda c: scores[c])\n",
+    "            pred_tags.append(pred_tag)\n",
+    "        return pred_tags"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Training loop"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tqdm import tqdm\n",
+    "\n",
+    "def train_perceptron(train_data, n_epochs=1):\n",
+    "    # Collect the tags in the training data\n",
+    "    tags = set()\n",
+    "    for tagged_sentence in train_data:\n",
+    "        words, gold_tags = zip(*tagged_sentence)\n",
+    "        tags.update(gold_tags)\n",
+    "\n",
+    "    # Initialise and train the perceptron tagger\n",
+    "    tagger = PerceptronTagger(tags)\n",
+    "    trainer = PerceptronTrainer(tagger.model)\n",
+    "    for epoch in range(n_epochs):\n",
+    "        with tqdm(total=sum(1 for s in train_data)) as pbar:\n",
+    "            for tagged_sentence in train_data:\n",
+    "                words, gold_tags = zip(*tagged_sentence)\n",
+    "                pred_tags = []\n",
+    "                for i, gold_tag in enumerate(gold_tags):\n",
+    "                    features = tagger.featurize(words, i, pred_tags)\n",
+    "                    trainer.update(features, gold_tag)\n",
+    "                    pred_tags.append(gold_tag)\n",
+    "                pbar.update()\n",
+    "    trainer.finalize()\n",
+    "\n",
+    "    return tagger"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Problem 1: Evaluation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Your first task is to implement a function that computes the accuracy of the tagger on gold-standard data. You have already implemented this function for the base lab, so you should be able to just copy-and-paste it here."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def accuracy(tagger, gold_data):\n",
+    "    # TODO: Replace the next line with your own code\n",
+    "    return 0.0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Problem 2: Feature engineering"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Your main task now is to try to improve the performance of the perceptron tagger by adding new features. The only part of the code that you are allowed to change is the `featurize` method. Provide a short (ca. 150&nbsp;words) report on what features you added and what results you obtained."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**⚠️ Your submitted notebook must contain output demonstrating at least 91% accuracy on the development set.**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tagger = train_perceptron(train_data, n_epochs=3)\n",
+    "print('{:.4f}'.format(accuracy(tagger, dev_data)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "*TODO: Insert your report here*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Chocolate Box Challenge"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To participate in the [Chocolate Box Challenge](https://www.kaggle.com/t/c8d8a804a4d24022af84cc22563682d6), run the next code cell to produce a file `submission.csv` and upload this file to Kaggle."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the test data (without the tags)\n",
+    "test_data = Dataset('test-notags.txt')\n",
+    "\n",
+    "# Generate submission.csv with results on both the dev data and the test data\n",
+    "with open('submission.csv', 'w') as target:\n",
+    "    target.write('Id,Tag\\n')\n",
+    "    for p, data in [('D', dev_data), ('T', test_data)]:\n",
+    "        for i, tagged_sentence in enumerate(data):\n",
+    "            words, _ = zip(*tagged_sentence)\n",
+    "            predicted_tags = tagger.predict(words)\n",
+    "            for j, tag in enumerate(predicted_tags):\n",
+    "                target.write('{}-{:04d}-{:04d},{}\\n'.format(p, i, j, tag))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Please observe the following rules for the Chocolate Box Challenge:\n",
+    "\n",
+    "> The point of the challenge is to come up with interesting features. You are not allowed to change the tagger in any other way. The maximal number of epochs that you may train your tagger with is&nbsp;5.\n",
+    "\n",
+    "Good luck, and may the best team win! 🙂"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.0"
+  },
+  "latex_envs": {
+   "bibliofile": "biblio.bib",
+   "cite_by": "apalike",
+   "current_citInitial": 1,
+   "eqLabelWithNumbers": true,
+   "eqNumInitial": 0
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
+%% Cell type:markdown id: tags:
+
+# L3X: Feature engineering for part-of-speech tagging
+
+%% Cell type:markdown id: tags:
+
+In this lab, you will practice your skills in feature engineering, the task of identifying useful features for a machine learning system.
+
+%% Cell type:markdown id: tags:
+
+## The data set
+
+%% Cell type:markdown id: tags:
+
+The data for this lab and their representation is the same as for the basic lab.
+
+%% Cell type:code id: tags:
+
+``` python
+class Dataset():
+
+    def __init__(self, filename):
+        self.filename = filename
+
+    def __iter__(self):
+        tmp = []
+        with open(self.filename, 'rt', encoding='utf-8') as lines:
+            for line in lines:
+                line = line.rstrip()
+                if line:
+                    tmp.append(tuple(line.split('\t')))
+                else:
+                    yield tmp
+                    tmp = []
+```
+
+%% Cell type:markdown id: tags:
+
+We load the training data and the development data for this lab:
+
+%% Cell type:code id: tags:
+
+``` python
+train_data = Dataset('train.txt')
+dev_data = Dataset('dev.txt')
+```
+
+%% Cell type:markdown id: tags:
+
+## Baseline tagger
+
+%% Cell type:markdown id: tags:
+
+The baseline tagger that you will use in this lab is a pure Python implementation of the perceptron tagger that was presented in Lecture&nbsp;3.3 and Lecture&nbsp;3.4. To understand what the code provided here does, and how it might be extended with new features, you should watch these two lectures.
+
+%% Cell type:markdown id: tags:
+
+### Linear model
+
+%% Cell type:code id: tags:
+
+``` python
+from collections import defaultdict
+
+class Linear(object):
+
+    def __init__(self, classes):
+        self.classes = sorted(classes)
+        self.weight = {c: defaultdict(float) for c in self.classes}
+        self.bias = {c: 0.0 for c in self.classes}
+
+    def forward(self, features):
+        scores = {}
+        for c in self.classes:
+            scores[c] = self.bias[c]
+            for f, v in features.items():
+                scores[c] += v * self.weight[c][f]
+        return scores
+```
+
+%% Cell type:markdown id: tags:
+
+### Perceptron learning algorithm
+
+%% Cell type:code id: tags:
+
+``` python
+class PerceptronTrainer(object):
+
+    def __init__(self, model):
+        self.model = model
+        self._acc = Linear(model.classes)
+        self._counter = 1
+
+    def update(self, features, gold):
+        scores = self.model.forward(features)
+        pred = max(self.model.classes, key=lambda c: scores[c])
+        if pred != gold:
+            self.model.bias[gold] += 1
+            self.model.bias[pred] -= 1
+            self._acc.bias[gold] += self._counter
+            self._acc.bias[pred] -= self._counter
+            for f, v in features.items():
+                self.model.weight[gold][f] += v
+                self.model.weight[pred][f] -= v
+                self._acc.weight[gold][f] += v * self._counter
+                self._acc.weight[pred][f] -= v * self._counter
+        self._counter += 1
+
+    def finalize(self):
+        for c in self.model.classes:
+            delta_b = self._acc.bias[c] / self._counter
+            self.model.bias[c] -= delta_b
+            for feat in self.model.weight[c]:
+                delta_w = self._acc.weight[c][feat] / self._counter
+                self.model.weight[c][feat] -= delta_w
+```
+
+%% Cell type:markdown id: tags:
+
+### Perceptron tagger
+
+%% Cell type:markdown id: tags:
+
+This is the part of the code that you will have to modify.
+
+%% Cell type:code id: tags:
+
+``` python
+class PerceptronTagger(object):
+
+    def __init__(self, tags):
+        self.model = Linear(tags)
+
+    def featurize(self, words, i, pred_tags):
+        # TODO: This is the only method that you are allowed to change!
+        feats = []
+        feats.append(words[i])
+        feats.append(words[i-1] if i > 0 else '<bos>')
+        feats.append(words[i+1] if i + 1 < len(words) else '<eos>')
+        feats.append(pred_tags[i-1] if i > 0 else '<bos>')
+        return {(i, f): 1 for i, f in enumerate(feats)}
+
+    def predict(self, words):
+        pred_tags = []
+        for i, _ in enumerate(words):
+            features = self.featurize(words, i, pred_tags)
+            scores = self.model.forward(features)
+            pred_tag = max(self.model.classes, key=lambda c: scores[c])
+            pred_tags.append(pred_tag)
+        return pred_tags
+```
+
+%% Cell type:markdown id: tags:
+
+### Training loop
+
+%% Cell type:code id: tags:
+
+``` python
+from tqdm import tqdm
+
+def train_perceptron(train_data, n_epochs=1):
+    # Collect the tags in the training data
+    tags = set()
+    for tagged_sentence in train_data:
+        words, gold_tags = zip(*tagged_sentence)
+        tags.update(gold_tags)
+
+    # Initialise and train the perceptron tagger
+    tagger = PerceptronTagger(tags)
+    trainer = PerceptronTrainer(tagger.model)
+    for epoch in range(n_epochs):
+        with tqdm(total=sum(1 for s in train_data)) as pbar:
+            for tagged_sentence in train_data:
+                words, gold_tags = zip(*tagged_sentence)
+                pred_tags = []
+                for i, gold_tag in enumerate(gold_tags):
+                    features = tagger.featurize(words, i, pred_tags)
+                    trainer.update(features, gold_tag)
+                    pred_tags.append(gold_tag)
+                pbar.update()
+    trainer.finalize()
+
+    return tagger
+```
+
+%% Cell type:markdown id: tags:
+
+## Problem 1: Evaluation
+
+%% Cell type:markdown id: tags:
+
+Your first task is to implement a function that computes the accuracy of the tagger on gold-standard data. You have already implemented this function for the base lab, so you should be able to just copy-and-paste it here.
+
+%% Cell type:code id: tags:
+
+``` python
+def accuracy(tagger, gold_data):
+    # TODO: Replace the next line with your own code
+    return 0.0
+```
+
+%% Cell type:markdown id: tags:
+
+## Problem 2: Feature engineering
+
+%% Cell type:markdown id: tags:
+
+Your main task now is to try to improve the performance of the perceptron tagger by adding new features. The only part of the code that you are allowed to change is the `featurize` method. Provide a short (ca. 150&nbsp;words) report on what features you added and what results you obtained.
+
+%% Cell type:markdown id: tags:
+
+**⚠️ Your submitted notebook must contain output demonstrating at least 91% accuracy on the development set.**
+
+%% Cell type:code id: tags:
+
+``` python
+tagger = train_perceptron(train_data, n_epochs=3)
+print('{:.4f}'.format(accuracy(tagger, dev_data)))
+```
+
+%% Cell type:markdown id: tags:
+
+*TODO: Insert your report here*
+
+%% Cell type:markdown id: tags:
+
+## Chocolate Box Challenge
+
+%% Cell type:markdown id: tags:
+
+To participate in the [Chocolate Box Challenge](https://www.kaggle.com/t/c8d8a804a4d24022af84cc22563682d6), run the next code cell to produce a file `submission.csv` and upload this file to Kaggle.
+
+%% Cell type:code id: tags:
+
+``` python
+# Load the test data (without the tags)
+test_data = Dataset('test-notags.txt')
+
+# Generate submission.csv with results on both the dev data and the test data
+with open('submission.csv', 'w') as target:
+    target.write('Id,Tag\n')
+    for p, data in [('D', dev_data), ('T', test_data)]:
+        for i, tagged_sentence in enumerate(data):
+            words, _ = zip(*tagged_sentence)
+            predicted_tags = tagger.predict(words)
+            for j, tag in enumerate(predicted_tags):
+                target.write('{}-{:04d}-{:04d},{}\n'.format(p, i, j, tag))
+```
+
+%% Cell type:markdown id: tags:
+
+Please observe the following rules for the Chocolate Box Challenge:
+
+> The point of the challenge is to come up with interesting features. You are not allowed to change the tagger in any other way. The maximal number of epochs that you may train your tagger with is&nbsp;5.
+
+Good luck, and may the best team win! 🙂
--- a/labs/l3x/dev.txt
+++ b/labs/l3x/dev.txt
--- a/labs/l3x/test-notags.txt
+++ b/labs/l3x/test-notags.txt
--- a/labs/l3x/train.txt
+++ b/labs/l3x/train.txt