Skip to content
Snippets Groups Projects
Commit 3063a292 authored by Umamaheswarababu Maddela's avatar Umamaheswarababu Maddela
Browse files

Upload New File

parent 0be5bd06
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id:17cab251 tags:
<div class="alert alert-info">
➡️ Make sure that you have read the **[rules for hand-in assignments](https://www.ida.liu.se/~TDDE16/exam.en.shtml#handins)** and the **[policy on cheating and plagiarism](https://www.ida.liu.se/~TDDE16/exam.en.shtml#cheating)** before starting with this lab.
➡️ Make sure you fill in any cells (and _only_ those cells) that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**, and do _not_ modify any of the other cells.
➡️ **Before you submit your lab, make sure everything runs as expected.** For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).
</div>
%% Cell type:markdown id:91c4f84c-26c4-4bcb-81d1-8757088f0623 tags:
# L1: Information Retrieval
In this lab you will apply basic techniques from information retrieval to implement the core of a minimalistic search engine. The data for this lab consists of a collection of app descriptions scraped from the [Google Play Store](https://play.google.com/store/apps?hl=en). From this collection, your search engine should retrieve those apps whose descriptions best match a given query under the vector space model.
%% Cell type:code id:2c92aa93-cf15-4e1c-975e-fea9bbe0b0c4 tags:
``` python
# Define some helper functions that are used in this notebook
from IPython.display import display, HTML
def success():
display(HTML('<div class="alert alert-success"><strong>Solution appears correct!</strong></div>'))
```
%% Cell type:markdown id:d2b5345b-0f8f-4a58-b7d3-bd5baae0c281 tags:
## Data set
%% Cell type:markdown id:68a82b5c-1660-4389-942b-f15420289549 tags:
The app descriptions come in the form of a compressed [JSON](https://en.wikipedia.org/wiki/JSON) file. Start by loading this file into a [Pandas](https://pandas.pydata.org) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe).
%% Cell type:code id:fd4982e3-3df4-4837-97b5-4c300b0d4a20 tags:
``` python
import bz2
import numpy as np
import pandas as pd
with bz2.open('app-descriptions.json.bz2') as source:
df = pd.read_json(source)
```
%% Cell type:markdown id:30823068-3102-430d-9989-b4f530756a08 tags:
In Pandas, a DataFrame is a table with indexed rows and labelled columns of potentially different types. You can access data in a DataFrame in various ways, including by row and column. To give an example, the code in the next cell shows rows 200–204:
%% Cell type:code id:9bb955a7-68d4-4e01-b0bd-16ae110af477 tags:
``` python
df[200:205]
```
%% Output
name \
200 Brick Breaker Star: Space King
201 Brick Classic - Brick Game
202 Bricks Breaker - Glow Balls
203 Bricks Breaker Quest
204 Brothers in Arms® 3
description
200 Introducing the best Brick Breaker game that e...
201 Classic Brick Game!\n\nBrick Classic is a popu...
202 Bricks Breaker - Glow Balls is a addictive and...
203 How to play\n- The ball flies to wherever you ...
204 Fight brave soldiers from around the globe on ...
%% Cell type:markdown id:27186bc1-de90-4125-837a-7b4e8671d276 tags:
As you can see, there are two labelled columns: `name` (the name of the app) and `description` (a textual description). The code in the next cell shows how to access fields from the description column.
%% Cell type:code id:595bb7af-5ee4-4b8e-8f5e-df5aaae688d9 tags:
``` python
df['description'][200:205]
```
%% Output
200 Introducing the best Brick Breaker game that e...
201 Classic Brick Game!\n\nBrick Classic is a popu...
202 Bricks Breaker - Glow Balls is a addictive and...
203 How to play\n- The ball flies to wherever you ...
204 Fight brave soldiers from around the globe on ...
Name: description, dtype: object
%% Cell type:markdown id:35e9d0cb-a8bb-4e94-9b8c-b6b33651ac8e tags:
## Problem 1: Preprocessing
%% Cell type:markdown id:d81a9865-f314-4d80-9fac-e6544413ee3a tags:
Your first task is to implement a preprocessor for your search engine. In the vector space model, *preprocessing* refers to any transformation applied to a text before vectorisation. Here you can restrict yourself to a simple type of preprocessing: tokenisation, stop word removal, and lemmatisation.
To implement your preprocessor, you can use [spaCy](https://spacy.io). Make sure to read the [Linguistic annotations](https://spacy.io/usage/spacy-101#annotations) section of the spaCy&nbsp;101; that section contains all the information you need for this problem (and more).
Implement your preprocessor by completing the skeleton code in the next cell, adding additional code as you deem necessary.
%% Cell type:code id:1a2a6fc6-dee8-4140-bf7a-1a245e60a1b3 tags:
``` python
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])
def preprocess(text):
"""Preprocess the given text by tokenising it, removing any stop words,
replacing each remaining token with its lemma (base form), and discarding
all lemmas that contain non-alphabetical characters.
Arguments:
text (str): The text to preprocess.
Returns:
The list of remaining lemmas after preprocessing (represented as strings).
"""
# YOUR CODE HERE
# Tokenization and removing of stop words
doc = nlp(text)
tokens = [token for token in doc if not token.is_stop]
# Lemmatisation
tokens_lem =[token.lemma_ for token in tokens ]
return [token for token in tokens_lem if token.isalpha()]
raise NotImplementedError()
```
%% Cell type:markdown id:8af08b57-8e0c-4526-b7fc-94c8bace1936 tags:
### 🤞 Test your code
Test your implementation by running the following cell:
%% Cell type:code id:e918affb-f8aa-4cbf-9cbe-b03b9251e941 tags:
``` python
"""Check that the preprocessing returns the correct output for a number of test cases."""
assert (
preprocess('Apple is looking at buying U.K. startup for $1 billion') ==
['Apple', 'look', 'buy', 'startup', 'billion']
)
assert (
preprocess('"Love Story" is a country pop song written and sung by Taylor Swift.') ==
['Love', 'Story', 'country', 'pop', 'song', 'write', 'sing', 'Taylor', 'Swift']
)
success()
```
%% Output
%% Cell type:markdown id:8e2852ff-b0ca-45b1-b904-530a1ae3494f tags:
## Problem 2: Vectorising
%% Cell type:markdown id:97a37f90-e7d8-4542-89e0-bea664601769 tags:
Your next task is to vectorise the data – and more specifically, to map each app description to a tf–idf vector. For this you can use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class from [scikit-learn](https://scikit-learn.org/stable/). Make sure to specify your preprocessor from the previous problem as the `tokenizer` &ndash; not the `preprocessor`! &ndash; for the vectoriser. (In scikit-learn terminology, the `preprocessor` handles string-level preprocessing.)
After running the following cell:
- `vectorizer` should contain the vectorizer fitted on `df['description']`
- `X` should contain the vectorized `df['description']`
%% Cell type:code id:0ca674d8-c2df-4c8f-bb3d-6d59bdc401fb tags:
``` python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=preprocess).fit(df['description'])
X = vectorizer.fit_transform(df['description'])
#raise NotImplementedError()
```
%% Output
C:\Users\Dell\miniconda3\envs\liu-text-mining\Lib\site-packages\sklearn\feature_extraction\text.py:525: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
warnings.warn(
%% Cell type:markdown id:12c88775-0daf-4c6a-8bc0-9a101bb1fa45 tags:
### 🤞 Test your code
Test your implementation by running the following cell:
%% Cell type:code id:edb03bab-edee-4b5d-98e3-c0fdb36ae507 tags:
``` python
"""Check that the dimensions of X are as expected."""
print(f"The dimensions of X are: {X.shape}")
assert X.shape[0] == 1614
assert 21200 < X.shape[1] < 21500
success()
```
%% Output
The dimensions of X are: (1614, 21356)
%% Cell type:markdown id:7f93ad72-b189-4ca4-80c0-c6f15eebe1e0 tags:
The dimensions of `X` should be around 1614$\times$21356; the number of rows should be _exactly_ 1614 , while the number of columns may differ from that given here depending on the version of spaCy and the version of the language model used, as well as the pre-processing.
%% Cell type:markdown id:a32e7b26-3b4f-4198-89f0-37bf54086827 tags:
## Problem 3: Retrieving
%% Cell type:markdown id:2413f786-8a38-4321-85d7-35e571f97aba tags:
To complete the search engine, your last task is to write a function that returns the most relevant app descriptions for a given query. An easy way to solve this task is to use scikit-learn&rsquo;s [NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html) class. That class implements unsupervised nearest neighbours learning and allows you to easily find a predefined number of app descriptions whose vector representations are closest to the query vector.
First, instantiate and fit a class that returns the _ten (10)_ nearest neighbors:
%% Cell type:code id:80145071-fe5b-4366-87d0-99c39567a736 tags:
``` python
"""Instantiate and fit a class that returns the 10 nearest neighbors."""
# YOUR CODE HERE
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=10)
neigh.fit(X)
```
%% Output
NearestNeighbors(n_neighbors=10)
%% Cell type:markdown id:5c79b405-a236-4caf-a140-dff01e0ee88b tags:
Second, implement a function that uses the fitted class to find the nearest neighbours for a given query string:
%% Cell type:code id:d4efd824-0122-40de-9fc6-efd071be906a tags:
``` python
def search(query):
"""Find the nearest neighbors in `df` for a query string.
Arguments:
query (str): A query string.
Returns:
The 10 apps (with name and description) most similar (in terms of
cosine similarity) to the given query as a Pandas DataFrame.
"""
# YOUR CODE HERE
# Vectorize the query using the vectorizer implemented above
query_vec = vectorizer.transform([query])
# Find the 10 nearest neighbors of the query vector
near_indices = neigh.kneighbors(query_vec,return_distance=False)
# Retrieve the relevant app descriptions and names
relevant_apps = df.iloc[near_indices[0]]
return relevant_apps
raise NotImplementedError()
```
%% Cell type:markdown id:e2cbb9f7-1cf5-49b9-a364-7b23fb6bbd1d tags:
### 🤞 Test your code
Test your implementation by running the following cell, which will show the 10 best search results for the query _"dodge trains"_:
%% Cell type:code id:3d30a52e-bdac-412b-b9c4-8e9cc17436a6 tags:
``` python
"""Check that searching for "dodge trains" returns a DataFrame with ten results,
and that the top result is "Subway Surfers"."""
result = search('dodge trains')
display(result)
assert isinstance(result, pd.DataFrame), "Search results should be a Pandas DataFrame"
assert len(result) == 10, "Should return 10 search results"
assert result.iloc[0]["name"] == "Subway Surfers", "Top search result should be 'Subway Surfers'"
success()
```
%% Output
%% Cell type:markdown id:2f1dcaf2-2b9b-4be8-8c25-40c13d8cc4ef tags:
The top hit in the list should be _Subway Surfers_.
%% Cell type:markdown id:af8229ea-3424-496a-84dc-76df5e41319c tags:
## Problem 4: Finding terms with low/high idf
%% Cell type:markdown id:0f689e35-9d58-4a5a-8719-3c8580ba1fd5 tags:
Recall that the inverse document frequency (idf) of a term is the lower, the more documents from a given collection the term appears in. To get a better understanding for this concept, your next task is to write code to find out which terms from the app descriptions have the lowest/highest idf.
Start by sorting the terms in _increasing_ order of idf, breaking ties by falling back on alphabetic order, and store the result in the variable `terms`.
%% Cell type:code id:5c9df0a4-2c15-4d2e-ad21-f121f39a4c72 tags:
``` python
# YOUR CODE HERE
# A list of (term, idf) pairs
term_idf_pairs = list(zip(vectorizer.get_feature_names_out(), vectorizer.idf_))
# Sorting the terms primarly in increasing order of their idf value, secondarily by term in alphabetic order.
term_idf_pairs_sorted = sorted(term_idf_pairs, key=lambda x: (x[1], x[0]))
# storing the results in the variable 'terms'
terms = [term for term, _ in term_idf_pairs_sorted]
#raise NotImplementedError()
```
%% Cell type:markdown id:a7a2b1dd-82fd-49ae-8771-c4ecd58bbe64 tags:
The following cell prints the 10 terms with the lowest/highest idf, which you can use to check if your results appear correct:
%% Cell type:code id:56f80f09-da92-4384-9989-42e650416d91 tags:
``` python
"""Print first 10/last 10 terms."""
print(f"Terms with the lowest idf:\n{terms[:10]}\n")
print(f"Terms with the highest idf:\n{terms[-10:]}")
```
%% Output
Terms with the lowest idf:
['game', 'play', 'feature', 'free', 'new', 'world', 'time', 'app', 'fun', 'use']
Terms with the highest idf:
['회원가입에', '회원을', '획득한', '효과', '효과음', 'find', 'finger', 'finish', 'first', 'flye']
%% Cell type:markdown id:a099ea52-599e-4c01-99d9-e7388d2c5be8 tags:
## Problem 5: Keyword extraction
%% Cell type:markdown id:d2aacd2e-3de9-49fe-aab2-2e8b5577ac1b tags:
We often want to extract salient keywords from a document. A simple method is to pick the $k$ terms with the highest tf–idf value. Your last task in this lab is to implement this method. More specifically, we ask you to implement a function `keywords` that extracts keywords from a text.
%% Cell type:code id:521a307f-eeab-48ec-b139-d5111e2b2ba8 tags:
``` python
def keywords(text, n=10):
"""
Arguments:
text (str): The text from which to extract keywords.
n (int): The number of keywords to extract. [default: 10]
Returns:
A list containing the `n` most salient keywords from `text`, as measured by
their tf–idf value relative to the collection of app descriptions.
"""
# YOUR CODE HERE
# Vectorize (tf-idf) the text using the vectorizer implemented above
text_vec = vectorizer.transform([text])
# All unique terms from the vectorizer
all_terms = vectorizer.get_feature_names_out()
# Sort the terms in descending order based on their TF-IDF values
sort_indices = text_vec.toarray()[0].argsort()[::-1]
# The top 'n' terms with the highest tf-idf values
top_terms = [all_terms[i] for i in sort_indices[:n]]
return top_terms
raise NotImplementedError()
```
%% Cell type:markdown id:07a4a6b3-78dd-4ece-9c56-fa73c6319ec0 tags:
### 🤞 Test your code
Test your implementation by running the following cell:
%% Cell type:code id:eb0344ed-f785-41c0-b74b-80208f148ab8 tags:
``` python
"""Check that the most salient keywords from the description of 'Train Conductor World'
overlap substantially with the expected list of keywords."""
out = keywords(df['description'][1428])
print(out)
assert len(out) == 10
assert len(
set(out) & set(['train', 'railway', 'railroad', 'rail', 'chaos', 'crash', 'timetable', 'overcast', 'haul', 'tram'])
) >= 6, "Keywords for df['description'][1428] do not overlap substantially with the expected result"
success()
```
%% Output
['train', 'railway', 'railroad', 'rail', 'chaos', 'crash', 'locomotive', 'overcast', 'timetable', 'tram']
%% Cell type:markdown id:e6ff1084-839a-4fbb-9f97-f1db934647bc tags:
The cell above prints the most salient keywords from the description of the app "Train Conductor World". The exact output may differ slightly depending on the strategy used to break ties, so the cell only checks if there is a sufficient overlap.
%% Cell type:markdown id:125ccdbd-4375-4d2f-8b1d-f47097ef2e84 tags:
**Congratulations on finishing this lab! 👍**
<div class="alert alert-info">
➡️ Don't forget to **test that everything runs as expected** before you submit!
</div>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment