Upload New File

3063a292 · Umamaheswarababu Maddela · 0be5bd06 · 3063a292
Commit 3063a292 authored 1 year ago by Umamaheswarababu Maddela
--- a/TM-Lab1.ipynb
+++ b/TM-Lab1.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "17cab251",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "    \n",
+    "➡️ Make sure that you have read the **[rules for hand-in assignments](https://www.ida.liu.se/~TDDE16/exam.en.shtml#handins)** and the **[policy on cheating and plagiarism](https://www.ida.liu.se/~TDDE16/exam.en.shtml#cheating)** before starting with this lab.\n",
+    "\n",
+    "➡️ Make sure you fill in any cells (and _only_ those cells) that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**, and do _not_ modify any of the other cells.\n",
+    "\n",
+    "➡️ **Before you submit your lab, make sure everything runs as expected.** For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via \"Run$\\rightarrow$Restart Kernel and Run All Cells...\" in the menu (or the \"⏩\" button in the toolbar).\n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91c4f84c-26c4-4bcb-81d1-8757088f0623",
+   "metadata": {},
+   "source": [
+    "# L1: Information Retrieval\n",
+    "\n",
+    "In this lab you will apply basic techniques from information retrieval to implement the core of a minimalistic search engine. The data for this lab consists of a collection of app descriptions scraped from the [Google Play Store](https://play.google.com/store/apps?hl=en). From this collection, your search engine should retrieve those apps whose descriptions best match a given query under the vector space model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "2c92aa93-cf15-4e1c-975e-fea9bbe0b0c4",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "8f9dc2c0bcae45f12b202be2222aefcd",
+     "grade": false,
+     "grade_id": "cell-f766ed4c371f7a04",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Define some helper functions that are used in this notebook\n",
+    "\n",
+    "from IPython.display import display, HTML\n",
+    "\n",
+    "def success():\n",
+    "    display(HTML('<div class=\"alert alert-success\"><strong>Solution appears correct!</strong></div>'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2b5345b-0f8f-4a58-b7d3-bd5baae0c281",
+   "metadata": {},
+   "source": [
+    "## Data set"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68a82b5c-1660-4389-942b-f15420289549",
+   "metadata": {},
+   "source": [
+    "The app descriptions come in the form of a compressed [JSON](https://en.wikipedia.org/wiki/JSON) file. Start by loading this file into a [Pandas](https://pandas.pydata.org) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "fd4982e3-3df4-4837-97b5-4c300b0d4a20",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "85522a5eda16cc395ff1cfd712adf93d",
+     "grade": false,
+     "grade_id": "cell-c5ac0bec64889197",
+     "locked": true,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import bz2\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "with bz2.open('app-descriptions.json.bz2') as source:\n",
+    "    df = pd.read_json(source)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "30823068-3102-430d-9989-b4f530756a08",
+   "metadata": {},
+   "source": [
+    "In Pandas, a DataFrame is a table with indexed rows and labelled columns of potentially different types. You can access data in a DataFrame in various ways, including by row and column. To give an example, the code in the next cell shows rows 200–204:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "9bb955a7-68d4-4e01-b0bd-16ae110af477",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>name</th>\n",
+       "      <th>description</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>200</th>\n",
+       "      <td>Brick Breaker Star: Space King</td>\n",
+       "      <td>Introducing the best Brick Breaker game that e...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>201</th>\n",
+       "      <td>Brick Classic - Brick Game</td>\n",
+       "      <td>Classic Brick Game!\\n\\nBrick Classic is a popu...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>202</th>\n",
+       "      <td>Bricks Breaker - Glow Balls</td>\n",
+       "      <td>Bricks Breaker - Glow Balls is a addictive and...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>203</th>\n",
+       "      <td>Bricks Breaker Quest</td>\n",
+       "      <td>How to play\\n- The ball flies to wherever you ...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>204</th>\n",
+       "      <td>Brothers in Arms® 3</td>\n",
+       "      <td>Fight brave soldiers from around the globe on ...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                               name  \\\n",
+       "200  Brick Breaker Star: Space King   \n",
+       "201      Brick Classic - Brick Game   \n",
+       "202     Bricks Breaker - Glow Balls   \n",
+       "203            Bricks Breaker Quest   \n",
+       "204             Brothers in Arms® 3   \n",
+       "\n",
+       "                                           description  \n",
+       "200  Introducing the best Brick Breaker game that e...  \n",
+       "201  Classic Brick Game!\\n\\nBrick Classic is a popu...  \n",
+       "202  Bricks Breaker - Glow Balls is a addictive and...  \n",
+       "203  How to play\\n- The ball flies to wherever you ...  \n",
+       "204  Fight brave soldiers from around the globe on ...  "
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df[200:205]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "27186bc1-de90-4125-837a-7b4e8671d276",
+   "metadata": {},
+   "source": [
+    "As you can see, there are two labelled columns: `name` (the name of the app) and `description` (a textual description). The code in the next cell shows how to access fields from the description column."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "595bb7af-5ee4-4b8e-8f5e-df5aaae688d9",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "200    Introducing the best Brick Breaker game that e...\n",
+       "201    Classic Brick Game!\\n\\nBrick Classic is a popu...\n",
+       "202    Bricks Breaker - Glow Balls is a addictive and...\n",
+       "203    How to play\\n- The ball flies to wherever you ...\n",
+       "204    Fight brave soldiers from around the globe on ...\n",
+       "Name: description, dtype: object"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df['description'][200:205]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35e9d0cb-a8bb-4e94-9b8c-b6b33651ac8e",
+   "metadata": {},
+   "source": [
+    "## Problem 1: Preprocessing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d81a9865-f314-4d80-9fac-e6544413ee3a",
+   "metadata": {},
+   "source": [
+    "Your first task is to implement a preprocessor for your search engine. In the vector space model, *preprocessing* refers to any transformation applied to a text before vectorisation. Here you can restrict yourself to a simple type of preprocessing: tokenisation, stop word removal, and lemmatisation.\n",
+    "\n",
+    "To implement your preprocessor, you can use [spaCy](https://spacy.io). Make sure to read the [Linguistic annotations](https://spacy.io/usage/spacy-101#annotations) section of the spaCy&nbsp;101; that section contains all the information you need for this problem (and more).\n",
+    "\n",
+    "Implement your preprocessor by completing the skeleton code in the next cell, adding additional code as you deem necessary."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "1a2a6fc6-dee8-4140-bf7a-1a245e60a1b3",
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "cfddd1c7bdc1c3061f5ecee4926ccb9d",
+     "grade": false,
+     "grade_id": "cell-2df4eac94cdf6be3",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import spacy\n",
+    "nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])\n",
+    "\n",
+    "def preprocess(text):\n",
+    "    \"\"\"Preprocess the given text by tokenising it, removing any stop words, \n",
+    "    replacing each remaining token with its lemma (base form), and discarding \n",
+    "    all lemmas that contain non-alphabetical characters.\n",
+    "\n",
+    "    Arguments:\n",
+    "      text (str): The text to preprocess.\n",
+    "\n",
+    "    Returns:\n",
+    "      The list of remaining lemmas after preprocessing (represented as strings).\n",
+    "    \"\"\"\n",
+    "    # YOUR CODE HERE\n",
+    "\n",
+    "    # Tokenization and removing of stop words\n",
+    "    doc = nlp(text)\n",
+    "    tokens = [token for token in doc if not token.is_stop]\n",
+    "   \n",
+    "    # Lemmatisation\n",
+    "    tokens_lem =[token.lemma_ for token in tokens ]\n",
+    "    \n",
+    "    return [token for token in tokens_lem if token.isalpha()]    \n",
+    "    \n",
+    "    raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8af08b57-8e0c-4526-b7fc-94c8bace1936",
+   "metadata": {},
+   "source": [
+    "### 🤞 Test your code\n",
+    "\n",
+    "Test your implementation by running the following cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "e918affb-f8aa-4cbf-9cbe-b03b9251e941",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "539b9043be91aac1f9d82fb62dea6112",
+     "grade": true,
+     "grade_id": "cell-642185b139f2cef6",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"alert alert-success\"><strong>Solution appears correct!</strong></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "\"\"\"Check that the preprocessing returns the correct output for a number of test cases.\"\"\"\n",
+    "\n",
+    "assert (\n",
+    "    preprocess('Apple is looking at buying U.K. startup for $1 billion') ==\n",
+    "    ['Apple', 'look', 'buy', 'startup', 'billion']\n",
+    ")\n",
+    "assert (\n",
+    "    preprocess('\"Love Story\" is a country pop song written and sung by Taylor Swift.') ==\n",
+    "    ['Love', 'Story', 'country', 'pop', 'song', 'write', 'sing', 'Taylor', 'Swift']\n",
+    ")\n",
+    "success()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8e2852ff-b0ca-45b1-b904-530a1ae3494f",
+   "metadata": {},
+   "source": [
+    "## Problem 2: Vectorising"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "97a37f90-e7d8-4542-89e0-bea664601769",
+   "metadata": {},
+   "source": [
+    "Your next task is to vectorise the data – and more specifically, to map each app description to a tf–idf vector. For this you can use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class from [scikit-learn](https://scikit-learn.org/stable/). Make sure to specify your preprocessor from the previous problem as the `tokenizer` &ndash; not the `preprocessor`! &ndash; for the vectoriser. (In scikit-learn terminology, the `preprocessor` handles string-level preprocessing.)\n",
+    "\n",
+    "After running the following cell:\n",
+    "- `vectorizer` should contain the vectorizer fitted on `df['description']`\n",
+    "- `X` should contain the vectorized `df['description']`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "0ca674d8-c2df-4c8f-bb3d-6d59bdc401fb",
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "f837f579b922fc7145fe30b2c26a1e10",
+     "grade": false,
+     "grade_id": "cell-eeff6351582552c5",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true,
+     "task": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "C:\\Users\\Dell\\miniconda3\\envs\\liu-text-mining\\Lib\\site-packages\\sklearn\\feature_extraction\\text.py:525: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
+    "\n",
+    "vectorizer = TfidfVectorizer(tokenizer=preprocess).fit(df['description'])\n",
+    "X = vectorizer.fit_transform(df['description'])\n",
+    "\n",
+    "#raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12c88775-0daf-4c6a-8bc0-9a101bb1fa45",
+   "metadata": {},
+   "source": [
+    "### 🤞 Test your code\n",
+    "\n",
+    "Test your implementation by running the following cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "edb03bab-edee-4b5d-98e3-c0fdb36ae507",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "5404c5720e00ab1c13924e5cf7c505cd",
+     "grade": true,
+     "grade_id": "cell-6c820c0e04315313",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The dimensions of X are: (1614, 21356)\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"alert alert-success\"><strong>Solution appears correct!</strong></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "\"\"\"Check that the dimensions of X are as expected.\"\"\"\n",
+    "\n",
+    "print(f\"The dimensions of X are: {X.shape}\")\n",
+    "assert X.shape[0] == 1614\n",
+    "assert 21200 < X.shape[1] < 21500\n",
+    "\n",
+    "success()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7f93ad72-b189-4ca4-80c0-c6f15eebe1e0",
+   "metadata": {},
+   "source": [
+    "The dimensions of `X` should be around 1614$\\times$21356; the number of rows should be _exactly_ 1614 , while the number of columns may differ from that given here depending on the version of spaCy and the version of the language model used, as well as the pre-processing."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a32e7b26-3b4f-4198-89f0-37bf54086827",
+   "metadata": {},
+   "source": [
+    "## Problem 3: Retrieving"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2413f786-8a38-4321-85d7-35e571f97aba",
+   "metadata": {},
+   "source": [
+    "To complete the search engine, your last task is to write a function that returns the most relevant app descriptions for a given query. An easy way to solve this task is to use scikit-learn&rsquo;s [NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) class. That class implements unsupervised nearest neighbours learning and allows you to easily find a predefined number of app descriptions whose vector representations are closest to the query vector.\n",
+    "\n",
+    "First, instantiate and fit a class that returns the _ten (10)_ nearest neighbors:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "80145071-fe5b-4366-87d0-99c39567a736",
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "c6ef66ce57f93d46bc57283865423a40",
+     "grade": false,
+     "grade_id": "cell-f9aa465499d29b7c",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true,
+     "task": false
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<style>#sk-container-id-1 {color: black;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>NearestNeighbors(n_neighbors=10)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">NearestNeighbors</label><div class=\"sk-toggleable__content\"><pre>NearestNeighbors(n_neighbors=10)</pre></div></div></div></div></div>"
+      ],
+      "text/plain": [
+       "NearestNeighbors(n_neighbors=10)"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "\"\"\"Instantiate and fit a class that returns the 10 nearest neighbors.\"\"\"\n",
+    "\n",
+    "# YOUR CODE HERE\n",
+    "\n",
+    "from sklearn.neighbors import NearestNeighbors\n",
+    "\n",
+    "neigh = NearestNeighbors(n_neighbors=10)\n",
+    "neigh.fit(X)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5c79b405-a236-4caf-a140-dff01e0ee88b",
+   "metadata": {},
+   "source": [
+    "Second, implement a function that uses the fitted class to find the nearest neighbours for a given query string:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "d4efd824-0122-40de-9fc6-efd071be906a",
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "613e1cce3f80da017557890ac51d8250",
+     "grade": false,
+     "grade_id": "cell-f87fb3083e21c7f8",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def search(query):\n",
+    "    \"\"\"Find the nearest neighbors in `df` for a query string.\n",
+    "\n",
+    "    Arguments:\n",
+    "      query (str): A query string.\n",
+    "\n",
+    "    Returns:\n",
+    "      The 10 apps (with name and description) most similar (in terms of\n",
+    "      cosine similarity) to the given query as a Pandas DataFrame.\n",
+    "    \"\"\"\n",
+    "    # YOUR CODE HERE\n",
+    "\n",
+    "    # Vectorize the query using the vectorizer implemented above\n",
+    "    query_vec = vectorizer.transform([query])\n",
+    "\n",
+    "    # Find the 10 nearest neighbors of the query vector\n",
+    "    near_indices = neigh.kneighbors(query_vec,return_distance=False)\n",
+    "\n",
+    "    # Retrieve the relevant app descriptions and names\n",
+    "    relevant_apps = df.iloc[near_indices[0]]\n",
+    "\n",
+    "    return relevant_apps\n",
+    "    \n",
+    "    raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e2cbb9f7-1cf5-49b9-a364-7b23fb6bbd1d",
+   "metadata": {},
+   "source": [
+    "### 🤞 Test your code\n",
+    "\n",
+    "Test your implementation by running the following cell, which will show the 10 best search results for the query _\"dodge trains\"_:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "3d30a52e-bdac-412b-b9c4-8e9cc17436a6",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "fcd119ee5b0895446eb9d5c275fc236d",
+     "grade": true,
+     "grade_id": "cell-8506d7fe5961f87b",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>name</th>\n",
+       "      <th>description</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>1301</th>\n",
+       "      <td>Subway Surfers</td>\n",
+       "      <td>DASH as fast as you can! \\nDODGE the oncoming ...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1428</th>\n",
+       "      <td>Train Conductor World</td>\n",
+       "      <td>Master and manage the chaos of international r...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1394</th>\n",
+       "      <td>Tiny Rails</td>\n",
+       "      <td>All aboard for an adventure around the world!\\...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1300</th>\n",
+       "      <td>Subway Princess Runner</td>\n",
+       "      <td>Subway princess runner, Bus run, forest rush w...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>998</th>\n",
+       "      <td>No Humanity - The Hardest Game</td>\n",
+       "      <td>2M+ Downloads All Over The World!\\n\\n* IGN Nom...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1429</th>\n",
+       "      <td>Train for Animals - BabyMagica free</td>\n",
+       "      <td>🚂 BabyMagica \"Train for Animals\" is a educatio...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1168</th>\n",
+       "      <td>Rush</td>\n",
+       "      <td>Are you ready for a thrilling ride?\\n\\nRush th...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>786</th>\n",
+       "      <td>LEGO® DUPLO® Train</td>\n",
+       "      <td>All aboard! Driving the colorful LEGO® DUPLO® ...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1465</th>\n",
+       "      <td>Virus War - Space Shooting Game</td>\n",
+       "      <td>Warning! Virus invasion! Destroy them with you...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>192</th>\n",
+       "      <td>Boxing Star</td>\n",
+       "      <td>Go for the K.O.!\\nMake Your Opponent See Stars...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                     name  \\\n",
+       "1301                       Subway Surfers   \n",
+       "1428                Train Conductor World   \n",
+       "1394                           Tiny Rails   \n",
+       "1300               Subway Princess Runner   \n",
+       "998        No Humanity - The Hardest Game   \n",
+       "1429  Train for Animals - BabyMagica free   \n",
+       "1168                                 Rush   \n",
+       "786                    LEGO® DUPLO® Train   \n",
+       "1465      Virus War - Space Shooting Game   \n",
+       "192                           Boxing Star   \n",
+       "\n",
+       "                                            description  \n",
+       "1301  DASH as fast as you can! \\nDODGE the oncoming ...  \n",
+       "1428  Master and manage the chaos of international r...  \n",
+       "1394  All aboard for an adventure around the world!\\...  \n",
+       "1300  Subway princess runner, Bus run, forest rush w...  \n",
+       "998   2M+ Downloads All Over The World!\\n\\n* IGN Nom...  \n",
+       "1429  🚂 BabyMagica \"Train for Animals\" is a educatio...  \n",
+       "1168  Are you ready for a thrilling ride?\\n\\nRush th...  \n",
+       "786   All aboard! Driving the colorful LEGO® DUPLO® ...  \n",
+       "1465  Warning! Virus invasion! Destroy them with you...  \n",
+       "192   Go for the K.O.!\\nMake Your Opponent See Stars...  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"alert alert-success\"><strong>Solution appears correct!</strong></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "\"\"\"Check that searching for \"dodge trains\" returns a DataFrame with ten results,\n",
+    "   and that the top result is \"Subway Surfers\".\"\"\"\n",
+    "\n",
+    "result = search('dodge trains')\n",
+    "display(result)\n",
+    "assert isinstance(result, pd.DataFrame), \"Search results should be a Pandas DataFrame\"\n",
+    "assert len(result) == 10, \"Should return 10 search results\"\n",
+    "assert result.iloc[0][\"name\"] == \"Subway Surfers\", \"Top search result should be 'Subway Surfers'\"\n",
+    "success()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2f1dcaf2-2b9b-4be8-8c25-40c13d8cc4ef",
+   "metadata": {},
+   "source": [
+    "The top hit in the list should be _Subway Surfers_."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "af8229ea-3424-496a-84dc-76df5e41319c",
+   "metadata": {},
+   "source": [
+    "## Problem 4: Finding terms with low/high idf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0f689e35-9d58-4a5a-8719-3c8580ba1fd5",
+   "metadata": {},
+   "source": [
+    "Recall that the inverse document frequency (idf) of a term is the lower, the more documents from a given collection the term appears in. To get a better understanding for this concept, your next task is to write code to find out which terms from the app descriptions have the lowest/highest idf.\n",
+    "\n",
+    "Start by sorting the terms in _increasing_ order of idf, breaking ties by falling back on alphabetic order, and store the result in the variable `terms`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "5c9df0a4-2c15-4d2e-ad21-f121f39a4c72",
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "162389cea256d40c34b8fa35a098d991",
+     "grade": false,
+     "grade_id": "cell-009867e6c5c4a8c9",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# YOUR CODE HERE\n",
+    "\n",
+    "# A list of (term, idf) pairs\n",
+    "term_idf_pairs = list(zip(vectorizer.get_feature_names_out(), vectorizer.idf_))\n",
+    "\n",
+    "# Sorting the terms primarly in increasing order of their idf value, secondarily by term in alphabetic order.\n",
+    "term_idf_pairs_sorted = sorted(term_idf_pairs, key=lambda x: (x[1], x[0]))\n",
+    "\n",
+    "# storing the results in the variable 'terms'\n",
+    "terms = [term for term, _ in term_idf_pairs_sorted]\n",
+    "\n",
+    "#raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a7a2b1dd-82fd-49ae-8771-c4ecd58bbe64",
+   "metadata": {},
+   "source": [
+    "The following cell prints the 10 terms with the lowest/highest idf, which you can use to check if your results appear correct:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "56f80f09-da92-4384-9989-42e650416d91",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "3e7626ffdb247346bfac3d6f6de5d537",
+     "grade": true,
+     "grade_id": "cell-a54a052ddaa94bf0",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Terms with the lowest idf:\n",
+      "['game', 'play', 'feature', 'free', 'new', 'world', 'time', 'app', 'fun', 'use']\n",
+      "\n",
+      "Terms with the highest idf:\n",
+      "['회원가입에', '회원을', '획득한', '효과', '효과음', 'ﬁnd', 'ﬁnger', 'ﬁnish', 'ﬁrst', 'ﬂye']\n"
+     ]
+    }
+   ],
+   "source": [
+    "\"\"\"Print first 10/last 10 terms.\"\"\"\n",
+    "\n",
+    "print(f\"Terms with the lowest idf:\\n{terms[:10]}\\n\")\n",
+    "print(f\"Terms with the highest idf:\\n{terms[-10:]}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a099ea52-599e-4c01-99d9-e7388d2c5be8",
+   "metadata": {},
+   "source": [
+    "## Problem 5: Keyword extraction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2aacd2e-3de9-49fe-aab2-2e8b5577ac1b",
+   "metadata": {},
+   "source": [
+    "We often want to extract salient keywords from a document. A simple method is to pick the $k$ terms with the highest tf–idf value. Your last task in this lab is to implement this method. More specifically, we ask you to implement a function `keywords` that extracts keywords from a text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "521a307f-eeab-48ec-b139-d5111e2b2ba8",
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "cb727ddf70f84ef9387a1bca41f0dc89",
+     "grade": false,
+     "grade_id": "cell-9c62bb6b91bb0383",
+     "locked": false,
+     "schema_version": 3,
+     "solution": true,
+     "task": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def keywords(text, n=10):\n",
+    "    \"\"\"\n",
+    "    Arguments:\n",
+    "      text (str): The text from which to extract keywords.\n",
+    "      n (int): The number of keywords to extract. [default: 10]\n",
+    "\n",
+    "    Returns:\n",
+    "      A list containing the `n` most salient keywords from `text`, as measured by\n",
+    "      their tf–idf value relative to the collection of app descriptions.\n",
+    "    \"\"\"\n",
+    "    # YOUR CODE HERE\n",
+    "\n",
+    "    # Vectorize (tf-idf) the text using the vectorizer implemented above\n",
+    "    text_vec = vectorizer.transform([text])\n",
+    "\n",
+    "    # All unique terms from the vectorizer\n",
+    "    all_terms = vectorizer.get_feature_names_out()\n",
+    "\n",
+    "    # Sort the terms in descending order based on their TF-IDF values\n",
+    "    sort_indices = text_vec.toarray()[0].argsort()[::-1]\n",
+    "\n",
+    "    # The top 'n' terms with the highest tf-idf values\n",
+    "    top_terms = [all_terms[i] for i in sort_indices[:n]]\n",
+    "\n",
+    "    return top_terms\n",
+    "\n",
+    "    \n",
+    "    raise NotImplementedError()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "07a4a6b3-78dd-4ece-9c56-fa73c6319ec0",
+   "metadata": {},
+   "source": [
+    "### 🤞 Test your code\n",
+    "\n",
+    "Test your implementation by running the following cell:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "eb0344ed-f785-41c0-b74b-80208f148ab8",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "cell_type": "code",
+     "checksum": "b6e85b6c4656024d3abb651dcdec37fd",
+     "grade": true,
+     "grade_id": "cell-476a0bf4768cbde9",
+     "locked": true,
+     "points": 1,
+     "schema_version": 3,
+     "solution": false,
+     "task": false
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['train', 'railway', 'railroad', 'rail', 'chaos', 'crash', 'locomotive', 'overcast', 'timetable', 'tram']\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"alert alert-success\"><strong>Solution appears correct!</strong></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "\"\"\"Check that the most salient keywords from the description of 'Train Conductor World'\n",
+    "   overlap substantially with the expected list of keywords.\"\"\"\n",
+    "\n",
+    "out = keywords(df['description'][1428])\n",
+    "print(out)\n",
+    "assert len(out) == 10\n",
+    "assert len(\n",
+    "    set(out) & set(['train', 'railway', 'railroad', 'rail', 'chaos', 'crash', 'timetable', 'overcast', 'haul', 'tram'])\n",
+    ") >= 6, \"Keywords for df['description'][1428] do not overlap substantially with the expected result\"\n",
+    "success()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e6ff1084-839a-4fbb-9f97-f1db934647bc",
+   "metadata": {},
+   "source": [
+    "The cell above prints the most salient keywords from the description of the app \"Train Conductor World\". The exact output may differ slightly depending on the strategy used to break ties, so the cell only checks if there is a sufficient overlap."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "125ccdbd-4375-4d2f-8b1d-f47097ef2e84",
+   "metadata": {},
+   "source": [
+    "**Congratulations on finishing this lab! 👍**\n",
+    "\n",
+    "<div class=\"alert alert-info\">\n",
+    "    \n",
+    "➡️ Don't forget to **test that everything runs as expected** before you submit!\n",
+    "\n",
+    "</div>"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
+%% Cell type:markdown id:17cab251 tags:
+
+<div class="alert alert-info">
+
+➡️ Make sure that you have read the **[rules for hand-in assignments](https://www.ida.liu.se/~TDDE16/exam.en.shtml#handins)** and the **[policy on cheating and plagiarism](https://www.ida.liu.se/~TDDE16/exam.en.shtml#cheating)** before starting with this lab.
+
+➡️ Make sure you fill in any cells (and _only_ those cells) that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**, and do _not_ modify any of the other cells.
+
+➡️ **Before you submit your lab, make sure everything runs as expected.** For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).
+
+</div>
+
+%% Cell type:markdown id:91c4f84c-26c4-4bcb-81d1-8757088f0623 tags:
+
+# L1: Information Retrieval
+
+In this lab you will apply basic techniques from information retrieval to implement the core of a minimalistic search engine. The data for this lab consists of a collection of app descriptions scraped from the [Google Play Store](https://play.google.com/store/apps?hl=en). From this collection, your search engine should retrieve those apps whose descriptions best match a given query under the vector space model.
+
+%% Cell type:code id:2c92aa93-cf15-4e1c-975e-fea9bbe0b0c4 tags:
+
+``` python
+# Define some helper functions that are used in this notebook
+
+from IPython.display import display, HTML
+
+def success():
+    display(HTML('<div class="alert alert-success"><strong>Solution appears correct!</strong></div>'))
+```
+
+%% Cell type:markdown id:d2b5345b-0f8f-4a58-b7d3-bd5baae0c281 tags:
+
+## Data set
+
+%% Cell type:markdown id:68a82b5c-1660-4389-942b-f15420289549 tags:
+
+The app descriptions come in the form of a compressed [JSON](https://en.wikipedia.org/wiki/JSON) file. Start by loading this file into a [Pandas](https://pandas.pydata.org) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe).
+
+%% Cell type:code id:fd4982e3-3df4-4837-97b5-4c300b0d4a20 tags:
+
+``` python
+import bz2
+import numpy as np
+import pandas as pd
+
+with bz2.open('app-descriptions.json.bz2') as source:
+    df = pd.read_json(source)
+```
+
+%% Cell type:markdown id:30823068-3102-430d-9989-b4f530756a08 tags:
+
+In Pandas, a DataFrame is a table with indexed rows and labelled columns of potentially different types. You can access data in a DataFrame in various ways, including by row and column. To give an example, the code in the next cell shows rows 200–204:
+
+%% Cell type:code id:9bb955a7-68d4-4e01-b0bd-16ae110af477 tags:
+
+``` python
+df[200:205]
+```
+
+%% Output
+
+                                   name  \
+    200  Brick Breaker Star: Space King
+    201      Brick Classic - Brick Game
+    202     Bricks Breaker - Glow Balls
+    203            Bricks Breaker Quest
+    204             Brothers in Arms® 3
+    
+                                               description
+    200  Introducing the best Brick Breaker game that e...
+    201  Classic Brick Game!\n\nBrick Classic is a popu...
+    202  Bricks Breaker - Glow Balls is a addictive and...
+    203  How to play\n- The ball flies to wherever you ...
+    204  Fight brave soldiers from around the globe on ...
+
+%% Cell type:markdown id:27186bc1-de90-4125-837a-7b4e8671d276 tags:
+
+As you can see, there are two labelled columns: `name` (the name of the app) and `description` (a textual description). The code in the next cell shows how to access fields from the description column.
+
+%% Cell type:code id:595bb7af-5ee4-4b8e-8f5e-df5aaae688d9 tags:
+
+``` python
+df['description'][200:205]
+```
+
+%% Output
+
+    200    Introducing the best Brick Breaker game that e...
+    201    Classic Brick Game!\n\nBrick Classic is a popu...
+    202    Bricks Breaker - Glow Balls is a addictive and...
+    203    How to play\n- The ball flies to wherever you ...
+    204    Fight brave soldiers from around the globe on ...
+    Name: description, dtype: object
+
+%% Cell type:markdown id:35e9d0cb-a8bb-4e94-9b8c-b6b33651ac8e tags:
+
+## Problem 1: Preprocessing
+
+%% Cell type:markdown id:d81a9865-f314-4d80-9fac-e6544413ee3a tags:
+
+Your first task is to implement a preprocessor for your search engine. In the vector space model, *preprocessing* refers to any transformation applied to a text before vectorisation. Here you can restrict yourself to a simple type of preprocessing: tokenisation, stop word removal, and lemmatisation.
+
+To implement your preprocessor, you can use [spaCy](https://spacy.io). Make sure to read the [Linguistic annotations](https://spacy.io/usage/spacy-101#annotations) section of the spaCy&nbsp;101; that section contains all the information you need for this problem (and more).
+
+Implement your preprocessor by completing the skeleton code in the next cell, adding additional code as you deem necessary.
+
+%% Cell type:code id:1a2a6fc6-dee8-4140-bf7a-1a245e60a1b3 tags:
+
+``` python
+import spacy
+nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])
+
+def preprocess(text):
+    """Preprocess the given text by tokenising it, removing any stop words,
+    replacing each remaining token with its lemma (base form), and discarding
+    all lemmas that contain non-alphabetical characters.
+
+    Arguments:
+      text (str): The text to preprocess.
+
+    Returns:
+      The list of remaining lemmas after preprocessing (represented as strings).
+    """
+    # YOUR CODE HERE
+
+    # Tokenization and removing of stop words
+    doc = nlp(text)
+    tokens = [token for token in doc if not token.is_stop]
+
+    # Lemmatisation
+    tokens_lem =[token.lemma_ for token in tokens ]
+
+    return [token for token in tokens_lem if token.isalpha()]
+
+    raise NotImplementedError()
+```
+
+%% Cell type:markdown id:8af08b57-8e0c-4526-b7fc-94c8bace1936 tags:
+
+### 🤞 Test your code
+
+Test your implementation by running the following cell:
+
+%% Cell type:code id:e918affb-f8aa-4cbf-9cbe-b03b9251e941 tags:
+
+``` python
+"""Check that the preprocessing returns the correct output for a number of test cases."""
+
+assert (
+    preprocess('Apple is looking at buying U.K. startup for $1 billion') ==
+    ['Apple', 'look', 'buy', 'startup', 'billion']
+)
+assert (
+    preprocess('"Love Story" is a country pop song written and sung by Taylor Swift.') ==
+    ['Love', 'Story', 'country', 'pop', 'song', 'write', 'sing', 'Taylor', 'Swift']
+)
+success()
+```
+
+%% Output
+
+
+%% Cell type:markdown id:8e2852ff-b0ca-45b1-b904-530a1ae3494f tags:
+
+## Problem 2: Vectorising
+
+%% Cell type:markdown id:97a37f90-e7d8-4542-89e0-bea664601769 tags:
+
+Your next task is to vectorise the data – and more specifically, to map each app description to a tf–idf vector. For this you can use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class from [scikit-learn](https://scikit-learn.org/stable/). Make sure to specify your preprocessor from the previous problem as the `tokenizer` &ndash; not the `preprocessor`! &ndash; for the vectoriser. (In scikit-learn terminology, the `preprocessor` handles string-level preprocessing.)
+
+After running the following cell:
+- `vectorizer` should contain the vectorizer fitted on `df['description']`
+- `X` should contain the vectorized `df['description']`
+
+%% Cell type:code id:0ca674d8-c2df-4c8f-bb3d-6d59bdc401fb tags:
+
+``` python
+from sklearn.feature_extraction.text import TfidfVectorizer
+
+vectorizer = TfidfVectorizer(tokenizer=preprocess).fit(df['description'])
+X = vectorizer.fit_transform(df['description'])
+
+#raise NotImplementedError()
+```
+
+%% Output
+
+    C:\Users\Dell\miniconda3\envs\liu-text-mining\Lib\site-packages\sklearn\feature_extraction\text.py:525: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
+      warnings.warn(
+
+%% Cell type:markdown id:12c88775-0daf-4c6a-8bc0-9a101bb1fa45 tags:
+
+### 🤞 Test your code
+
+Test your implementation by running the following cell:
+
+%% Cell type:code id:edb03bab-edee-4b5d-98e3-c0fdb36ae507 tags:
+
+``` python
+"""Check that the dimensions of X are as expected."""
+
+print(f"The dimensions of X are: {X.shape}")
+assert X.shape[0] == 1614
+assert 21200 < X.shape[1] < 21500
+
+success()
+```
+
+%% Output
+
+    The dimensions of X are: (1614, 21356)
+
+
+%% Cell type:markdown id:7f93ad72-b189-4ca4-80c0-c6f15eebe1e0 tags:
+
+The dimensions of `X` should be around 1614$\times$21356; the number of rows should be _exactly_ 1614 , while the number of columns may differ from that given here depending on the version of spaCy and the version of the language model used, as well as the pre-processing.
+
+%% Cell type:markdown id:a32e7b26-3b4f-4198-89f0-37bf54086827 tags:
+
+## Problem 3: Retrieving
+
+%% Cell type:markdown id:2413f786-8a38-4321-85d7-35e571f97aba tags:
+
+To complete the search engine, your last task is to write a function that returns the most relevant app descriptions for a given query. An easy way to solve this task is to use scikit-learn&rsquo;s [NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) class. That class implements unsupervised nearest neighbours learning and allows you to easily find a predefined number of app descriptions whose vector representations are closest to the query vector.
+
+First, instantiate and fit a class that returns the _ten (10)_ nearest neighbors:
+
+%% Cell type:code id:80145071-fe5b-4366-87d0-99c39567a736 tags:
+
+``` python
+"""Instantiate and fit a class that returns the 10 nearest neighbors."""
+
+# YOUR CODE HERE
+
+from sklearn.neighbors import NearestNeighbors
+
+neigh = NearestNeighbors(n_neighbors=10)
+neigh.fit(X)
+```
+
+%% Output
+
+    NearestNeighbors(n_neighbors=10)
+
+%% Cell type:markdown id:5c79b405-a236-4caf-a140-dff01e0ee88b tags:
+
+Second, implement a function that uses the fitted class to find the nearest neighbours for a given query string:
+
+%% Cell type:code id:d4efd824-0122-40de-9fc6-efd071be906a tags:
+
+``` python
+def search(query):
+    """Find the nearest neighbors in `df` for a query string.
+
+    Arguments:
+      query (str): A query string.
+
+    Returns:
+      The 10 apps (with name and description) most similar (in terms of
+      cosine similarity) to the given query as a Pandas DataFrame.
+    """
+    # YOUR CODE HERE
+
+    # Vectorize the query using the vectorizer implemented above
+    query_vec = vectorizer.transform([query])
+
+    # Find the 10 nearest neighbors of the query vector
+    near_indices = neigh.kneighbors(query_vec,return_distance=False)
+
+    # Retrieve the relevant app descriptions and names
+    relevant_apps = df.iloc[near_indices[0]]
+
+    return relevant_apps
+
+    raise NotImplementedError()
+```
+
+%% Cell type:markdown id:e2cbb9f7-1cf5-49b9-a364-7b23fb6bbd1d tags:
+
+### 🤞 Test your code
+
+Test your implementation by running the following cell, which will show the 10 best search results for the query _"dodge trains"_:
+
+%% Cell type:code id:3d30a52e-bdac-412b-b9c4-8e9cc17436a6 tags:
+
+``` python
+"""Check that searching for "dodge trains" returns a DataFrame with ten results,
+   and that the top result is "Subway Surfers"."""
+
+result = search('dodge trains')
+display(result)
+assert isinstance(result, pd.DataFrame), "Search results should be a Pandas DataFrame"
+assert len(result) == 10, "Should return 10 search results"
+assert result.iloc[0]["name"] == "Subway Surfers", "Top search result should be 'Subway Surfers'"
+success()
+```
+
+%% Output
+
+
+
+%% Cell type:markdown id:2f1dcaf2-2b9b-4be8-8c25-40c13d8cc4ef tags:
+
+The top hit in the list should be _Subway Surfers_.
+
+%% Cell type:markdown id:af8229ea-3424-496a-84dc-76df5e41319c tags:
+
+## Problem 4: Finding terms with low/high idf
+
+%% Cell type:markdown id:0f689e35-9d58-4a5a-8719-3c8580ba1fd5 tags:
+
+Recall that the inverse document frequency (idf) of a term is the lower, the more documents from a given collection the term appears in. To get a better understanding for this concept, your next task is to write code to find out which terms from the app descriptions have the lowest/highest idf.
+
+Start by sorting the terms in _increasing_ order of idf, breaking ties by falling back on alphabetic order, and store the result in the variable `terms`.
+
+%% Cell type:code id:5c9df0a4-2c15-4d2e-ad21-f121f39a4c72 tags:
+
+``` python
+# YOUR CODE HERE
+
+# A list of (term, idf) pairs
+term_idf_pairs = list(zip(vectorizer.get_feature_names_out(), vectorizer.idf_))
+
+# Sorting the terms primarly in increasing order of their idf value, secondarily by term in alphabetic order.
+term_idf_pairs_sorted = sorted(term_idf_pairs, key=lambda x: (x[1], x[0]))
+
+# storing the results in the variable 'terms'
+terms = [term for term, _ in term_idf_pairs_sorted]
+
+#raise NotImplementedError()
+```
+
+%% Cell type:markdown id:a7a2b1dd-82fd-49ae-8771-c4ecd58bbe64 tags:
+
+The following cell prints the 10 terms with the lowest/highest idf, which you can use to check if your results appear correct:
+
+%% Cell type:code id:56f80f09-da92-4384-9989-42e650416d91 tags:
+
+``` python
+"""Print first 10/last 10 terms."""
+
+print(f"Terms with the lowest idf:\n{terms[:10]}\n")
+print(f"Terms with the highest idf:\n{terms[-10:]}")
+```
+
+%% Output
+
+    Terms with the lowest idf:
+    ['game', 'play', 'feature', 'free', 'new', 'world', 'time', 'app', 'fun', 'use']
+    
+    Terms with the highest idf:
+    ['회원가입에', '회원을', '획득한', '효과', '효과음', 'ﬁnd', 'ﬁnger', 'ﬁnish', 'ﬁrst', 'ﬂye']
+
+%% Cell type:markdown id:a099ea52-599e-4c01-99d9-e7388d2c5be8 tags:
+
+## Problem 5: Keyword extraction
+
+%% Cell type:markdown id:d2aacd2e-3de9-49fe-aab2-2e8b5577ac1b tags:
+
+We often want to extract salient keywords from a document. A simple method is to pick the $k$ terms with the highest tf–idf value. Your last task in this lab is to implement this method. More specifically, we ask you to implement a function `keywords` that extracts keywords from a text.
+
+%% Cell type:code id:521a307f-eeab-48ec-b139-d5111e2b2ba8 tags:
+
+``` python
+def keywords(text, n=10):
+    """
+    Arguments:
+      text (str): The text from which to extract keywords.
+      n (int): The number of keywords to extract. [default: 10]
+
+    Returns:
+      A list containing the `n` most salient keywords from `text`, as measured by
+      their tf–idf value relative to the collection of app descriptions.
+    """
+    # YOUR CODE HERE
+
+    # Vectorize (tf-idf) the text using the vectorizer implemented above
+    text_vec = vectorizer.transform([text])
+
+    # All unique terms from the vectorizer
+    all_terms = vectorizer.get_feature_names_out()
+
+    # Sort the terms in descending order based on their TF-IDF values
+    sort_indices = text_vec.toarray()[0].argsort()[::-1]
+
+    # The top 'n' terms with the highest tf-idf values
+    top_terms = [all_terms[i] for i in sort_indices[:n]]
+
+    return top_terms
+
+
+    raise NotImplementedError()
+```
+
+%% Cell type:markdown id:07a4a6b3-78dd-4ece-9c56-fa73c6319ec0 tags:
+
+### 🤞 Test your code
+
+Test your implementation by running the following cell:
+
+%% Cell type:code id:eb0344ed-f785-41c0-b74b-80208f148ab8 tags:
+
+``` python
+"""Check that the most salient keywords from the description of 'Train Conductor World'
+   overlap substantially with the expected list of keywords."""
+
+out = keywords(df['description'][1428])
+print(out)
+assert len(out) == 10
+assert len(
+    set(out) & set(['train', 'railway', 'railroad', 'rail', 'chaos', 'crash', 'timetable', 'overcast', 'haul', 'tram'])
+) >= 6, "Keywords for df['description'][1428] do not overlap substantially with the expected result"
+success()
+```
+
+%% Output
+
+    ['train', 'railway', 'railroad', 'rail', 'chaos', 'crash', 'locomotive', 'overcast', 'timetable', 'tram']
+
+
+%% Cell type:markdown id:e6ff1084-839a-4fbb-9f97-f1db934647bc tags:
+
+The cell above prints the most salient keywords from the description of the app "Train Conductor World". The exact output may differ slightly depending on the strategy used to break ties, so the cell only checks if there is a sufficient overlap.
+
+%% Cell type:markdown id:125ccdbd-4375-4d2f-8b1d-f47097ef2e84 tags:
+
+**Congratulations on finishing this lab! 👍**
+
+<div class="alert alert-info">
+
+➡️ Don't forget to **test that everything runs as expected** before you submit!
+
+</div>