"➡️ Make sure that you have read the **[rules for hand-in assignments](https://www.ida.liu.se/~TDDE16/exam.en.shtml#handins)** and the **[policy on cheating and plagiarism](https://www.ida.liu.se/~TDDE16/exam.en.shtml#cheating)** before starting with this lab.\n",
"\n",
"➡️ Make sure you fill in any cells (and _only_ those cells) that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**, and do _not_ modify any of the other cells.\n",
"\n",
"➡️ **Before you submit your lab, make sure everything runs as expected.** For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via \"Run$\\rightarrow$Restart Kernel and Run All Cells...\" in the menu (or the \"⏩\" button in the toolbar).\n",
"\n",
"</div>"
]
},
{
"cell_type": "markdown",
"id": "91c4f84c-26c4-4bcb-81d1-8757088f0623",
"metadata": {},
"source": [
"# L1: Information Retrieval\n",
"\n",
"In this lab you will apply basic techniques from information retrieval to implement the core of a minimalistic search engine. The data for this lab consists of a collection of app descriptions scraped from the [Google Play Store](https://play.google.com/store/apps?hl=en). From this collection, your search engine should retrieve those apps whose descriptions best match a given query under the vector space model."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2c92aa93-cf15-4e1c-975e-fea9bbe0b0c4",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "8f9dc2c0bcae45f12b202be2222aefcd",
"grade": false,
"grade_id": "cell-f766ed4c371f7a04",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"# Define some helper functions that are used in this notebook\n",
"The app descriptions come in the form of a compressed [JSON](https://en.wikipedia.org/wiki/JSON) file. Start by loading this file into a [Pandas](https://pandas.pydata.org) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "fd4982e3-3df4-4837-97b5-4c300b0d4a20",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "85522a5eda16cc395ff1cfd712adf93d",
"grade": false,
"grade_id": "cell-c5ac0bec64889197",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"import bz2\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"with bz2.open('app-descriptions.json.bz2') as source:\n",
" df = pd.read_json(source)"
]
},
{
"cell_type": "markdown",
"id": "30823068-3102-430d-9989-b4f530756a08",
"metadata": {},
"source": [
"In Pandas, a DataFrame is a table with indexed rows and labelled columns of potentially different types. You can access data in a DataFrame in various ways, including by row and column. To give an example, the code in the next cell shows rows 200–204:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9bb955a7-68d4-4e01-b0bd-16ae110af477",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>description</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>200</th>\n",
" <td>Brick Breaker Star: Space King</td>\n",
" <td>Introducing the best Brick Breaker game that e...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>201</th>\n",
" <td>Brick Classic - Brick Game</td>\n",
" <td>Classic Brick Game!\\n\\nBrick Classic is a popu...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202</th>\n",
" <td>Bricks Breaker - Glow Balls</td>\n",
" <td>Bricks Breaker - Glow Balls is a addictive and...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>Bricks Breaker Quest</td>\n",
" <td>How to play\\n- The ball flies to wherever you ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>Brothers in Arms® 3</td>\n",
" <td>Fight brave soldiers from around the globe on ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name \\\n",
"200 Brick Breaker Star: Space King \n",
"201 Brick Classic - Brick Game \n",
"202 Bricks Breaker - Glow Balls \n",
"203 Bricks Breaker Quest \n",
"204 Brothers in Arms® 3 \n",
"\n",
" description \n",
"200 Introducing the best Brick Breaker game that e... \n",
"201 Classic Brick Game!\\n\\nBrick Classic is a popu... \n",
"202 Bricks Breaker - Glow Balls is a addictive and... \n",
"203 How to play\\n- The ball flies to wherever you ... \n",
"204 Fight brave soldiers from around the globe on ... "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[200:205]"
]
},
{
"cell_type": "markdown",
"id": "27186bc1-de90-4125-837a-7b4e8671d276",
"metadata": {},
"source": [
"As you can see, there are two labelled columns: `name` (the name of the app) and `description` (a textual description). The code in the next cell shows how to access fields from the description column."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "595bb7af-5ee4-4b8e-8f5e-df5aaae688d9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"200 Introducing the best Brick Breaker game that e...\n",
"201 Classic Brick Game!\\n\\nBrick Classic is a popu...\n",
"202 Bricks Breaker - Glow Balls is a addictive and...\n",
"203 How to play\\n- The ball flies to wherever you ...\n",
"204 Fight brave soldiers from around the globe on ...\n",
"Name: description, dtype: object"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['description'][200:205]"
]
},
{
"cell_type": "markdown",
"id": "35e9d0cb-a8bb-4e94-9b8c-b6b33651ac8e",
"metadata": {},
"source": [
"## Problem 1: Preprocessing"
]
},
{
"cell_type": "markdown",
"id": "d81a9865-f314-4d80-9fac-e6544413ee3a",
"metadata": {},
"source": [
"Your first task is to implement a preprocessor for your search engine. In the vector space model, *preprocessing* refers to any transformation applied to a text before vectorisation. Here you can restrict yourself to a simple type of preprocessing: tokenisation, stop word removal, and lemmatisation.\n",
"\n",
"To implement your preprocessor, you can use [spaCy](https://spacy.io). Make sure to read the [Linguistic annotations](https://spacy.io/usage/spacy-101#annotations) section of the spaCy 101; that section contains all the information you need for this problem (and more).\n",
"\n",
"Implement your preprocessor by completing the skeleton code in the next cell, adding additional code as you deem necessary."
"Your next task is to vectorise the data – and more specifically, to map each app description to a tf–idf vector. For this you can use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class from [scikit-learn](https://scikit-learn.org/stable/). Make sure to specify your preprocessor from the previous problem as the `tokenizer` – not the `preprocessor`! – for the vectoriser. (In scikit-learn terminology, the `preprocessor` handles string-level preprocessing.)\n",
"\n",
"After running the following cell:\n",
"- `vectorizer` should contain the vectorizer fitted on `df['description']`\n",
"- `X` should contain the vectorized `df['description']`"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0ca674d8-c2df-4c8f-bb3d-6d59bdc401fb",
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "f837f579b922fc7145fe30b2c26a1e10",
"grade": false,
"grade_id": "cell-eeff6351582552c5",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\Dell\\miniconda3\\envs\\liu-text-mining\\Lib\\site-packages\\sklearn\\feature_extraction\\text.py:525: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'\n",
"\"\"\"Check that the dimensions of X are as expected.\"\"\"\n",
"\n",
"print(f\"The dimensions of X are: {X.shape}\")\n",
"assert X.shape[0] == 1614\n",
"assert 21200 < X.shape[1] < 21500\n",
"\n",
"success()"
]
},
{
"cell_type": "markdown",
"id": "7f93ad72-b189-4ca4-80c0-c6f15eebe1e0",
"metadata": {},
"source": [
"The dimensions of `X` should be around 1614$\\times$21356; the number of rows should be _exactly_ 1614 , while the number of columns may differ from that given here depending on the version of spaCy and the version of the language model used, as well as the pre-processing."
]
},
{
"cell_type": "markdown",
"id": "a32e7b26-3b4f-4198-89f0-37bf54086827",
"metadata": {},
"source": [
"## Problem 3: Retrieving"
]
},
{
"cell_type": "markdown",
"id": "2413f786-8a38-4321-85d7-35e571f97aba",
"metadata": {},
"source": [
"To complete the search engine, your last task is to write a function that returns the most relevant app descriptions for a given query. An easy way to solve this task is to use scikit-learn’s [NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) class. That class implements unsupervised nearest neighbours learning and allows you to easily find a predefined number of app descriptions whose vector representations are closest to the query vector.\n",
"\n",
"First, instantiate and fit a class that returns the _ten (10)_ nearest neighbors:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "80145071-fe5b-4366-87d0-99c39567a736",
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "c6ef66ce57f93d46bc57283865423a40",
"grade": false,
"grade_id": "cell-f9aa465499d29b7c",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-1 {color: black;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>NearestNeighbors(n_neighbors=10)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">NearestNeighbors</label><div class=\"sk-toggleable__content\"><pre>NearestNeighbors(n_neighbors=10)</pre></div></div></div></div></div>"
],
"text/plain": [
"NearestNeighbors(n_neighbors=10)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\"\"Instantiate and fit a class that returns the 10 nearest neighbors.\"\"\"\n",
"assert result.iloc[0][\"name\"] == \"Subway Surfers\", \"Top search result should be 'Subway Surfers'\"\n",
"success()"
]
},
{
"cell_type": "markdown",
"id": "2f1dcaf2-2b9b-4be8-8c25-40c13d8cc4ef",
"metadata": {},
"source": [
"The top hit in the list should be _Subway Surfers_."
]
},
{
"cell_type": "markdown",
"id": "af8229ea-3424-496a-84dc-76df5e41319c",
"metadata": {},
"source": [
"## Problem 4: Finding terms with low/high idf"
]
},
{
"cell_type": "markdown",
"id": "0f689e35-9d58-4a5a-8719-3c8580ba1fd5",
"metadata": {},
"source": [
"Recall that the inverse document frequency (idf) of a term is the lower, the more documents from a given collection the term appears in. To get a better understanding for this concept, your next task is to write code to find out which terms from the app descriptions have the lowest/highest idf.\n",
"\n",
"Start by sorting the terms in _increasing_ order of idf, breaking ties by falling back on alphabetic order, and store the result in the variable `terms`."
"print(f\"Terms with the lowest idf:\\n{terms[:10]}\\n\")\n",
"print(f\"Terms with the highest idf:\\n{terms[-10:]}\")"
]
},
{
"cell_type": "markdown",
"id": "a099ea52-599e-4c01-99d9-e7388d2c5be8",
"metadata": {},
"source": [
"## Problem 5: Keyword extraction"
]
},
{
"cell_type": "markdown",
"id": "d2aacd2e-3de9-49fe-aab2-2e8b5577ac1b",
"metadata": {},
"source": [
"We often want to extract salient keywords from a document. A simple method is to pick the $k$ terms with the highest tf–idf value. Your last task in this lab is to implement this method. More specifically, we ask you to implement a function `keywords` that extracts keywords from a text."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "521a307f-eeab-48ec-b139-d5111e2b2ba8",
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "cb727ddf70f84ef9387a1bca41f0dc89",
"grade": false,
"grade_id": "cell-9c62bb6b91bb0383",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
}
},
"outputs": [],
"source": [
"def keywords(text, n=10):\n",
" \"\"\"\n",
" Arguments:\n",
" text (str): The text from which to extract keywords.\n",
" n (int): The number of keywords to extract. [default: 10]\n",
"\n",
" Returns:\n",
" A list containing the `n` most salient keywords from `text`, as measured by\n",
" their tf–idf value relative to the collection of app descriptions.\n",
" \"\"\"\n",
" # YOUR CODE HERE\n",
"\n",
" # Vectorize (tf-idf) the text using the vectorizer implemented above\n",
") >= 6, \"Keywords for df['description'][1428] do not overlap substantially with the expected result\"\n",
"success()"
]
},
{
"cell_type": "markdown",
"id": "e6ff1084-839a-4fbb-9f97-f1db934647bc",
"metadata": {},
"source": [
"The cell above prints the most salient keywords from the description of the app \"Train Conductor World\". The exact output may differ slightly depending on the strategy used to break ties, so the cell only checks if there is a sufficient overlap."
]
},
{
"cell_type": "markdown",
"id": "125ccdbd-4375-4d2f-8b1d-f47097ef2e84",
"metadata": {},
"source": [
"**Congratulations on finishing this lab! 👍**\n",
"\n",
"<div class=\"alert alert-info\">\n",
" \n",
"➡️ Don't forget to **test that everything runs as expected** before you submit!\n",
"\n",
"</div>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
%% Cell type:markdown id:17cab251 tags:
<divclass="alert alert-info">
➡️ Make sure that you have read the **[rules for hand-in assignments](https://www.ida.liu.se/~TDDE16/exam.en.shtml#handins)** and the **[policy on cheating and plagiarism](https://www.ida.liu.se/~TDDE16/exam.en.shtml#cheating)** before starting with this lab.
➡️ Make sure you fill in any cells (and _only_ those cells) that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**, and do _not_ modify any of the other cells.
➡️ **Before you submit your lab, make sure everything runs as expected.** For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).
In this lab you will apply basic techniques from information retrieval to implement the core of a minimalistic search engine. The data for this lab consists of a collection of app descriptions scraped from the [Google Play Store](https://play.google.com/store/apps?hl=en). From this collection, your search engine should retrieve those apps whose descriptions best match a given query under the vector space model.
The app descriptions come in the form of a compressed [JSON](https://en.wikipedia.org/wiki/JSON) file. Start by loading this file into a [Pandas](https://pandas.pydata.org)[DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe).
In Pandas, a DataFrame is a table with indexed rows and labelled columns of potentially different types. You can access data in a DataFrame in various ways, including by row and column. To give an example, the code in the next cell shows rows 200–204:
As you can see, there are two labelled columns: `name` (the name of the app) and `description` (a textual description). The code in the next cell shows how to access fields from the description column.
Your first task is to implement a preprocessor for your search engine. In the vector space model, *preprocessing* refers to any transformation applied to a text before vectorisation. Here you can restrict yourself to a simple type of preprocessing: tokenisation, stop word removal, and lemmatisation.
To implement your preprocessor, you can use [spaCy](https://spacy.io). Make sure to read the [Linguistic annotations](https://spacy.io/usage/spacy-101#annotations) section of the spaCy 101; that section contains all the information you need for this problem (and more).
Implement your preprocessor by completing the skeleton code in the next cell, adding additional code as you deem necessary.
Your next task is to vectorise the data – and more specifically, to map each app description to a tf–idf vector. For this you can use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class from [scikit-learn](https://scikit-learn.org/stable/). Make sure to specify your preprocessor from the previous problem as the `tokenizer`– not the `preprocessor`! – for the vectoriser. (In scikit-learn terminology, the `preprocessor` handles string-level preprocessing.)
After running the following cell:
-`vectorizer` should contain the vectorizer fitted on `df['description']`
-`X` should contain the vectorized `df['description']`
C:\Users\Dell\miniconda3\envs\liu-text-mining\Lib\site-packages\sklearn\feature_extraction\text.py:525: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
The dimensions of `X` should be around 1614$\times$21356; the number of rows should be _exactly_ 1614 , while the number of columns may differ from that given here depending on the version of spaCy and the version of the language model used, as well as the pre-processing.
To complete the search engine, your last task is to write a function that returns the most relevant app descriptions for a given query. An easy way to solve this task is to use scikit-learn’s [NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) class. That class implements unsupervised nearest neighbours learning and allows you to easily find a predefined number of app descriptions whose vector representations are closest to the query vector.
First, instantiate and fit a class that returns the _ten (10)_ nearest neighbors:
Recall that the inverse document frequency (idf) of a term is the lower, the more documents from a given collection the term appears in. To get a better understanding for this concept, your next task is to write code to find out which terms from the app descriptions have the lowest/highest idf.
Start by sorting the terms in _increasing_ order of idf, breaking ties by falling back on alphabetic order, and store the result in the variable `terms`.
We often want to extract salient keywords from a document. A simple method is to pick the $k$ terms with the highest tf–idf value. Your last task in this lab is to implement this method. More specifically, we ask you to implement a function `keywords` that extracts keywords from a text.
The cell above prints the most salient keywords from the description of the app "Train Conductor World". The exact output may differ slightly depending on the strategy used to break ties, so the cell only checks if there is a sufficient overlap.