➡️ Before you start, make sure that you are familiar with the **[study guide](https://liu-nlp.ai/text-mining/logistics/)**, in particular the rules around **cheating and plagiarism** (found in the course memo).
➡️ Before you start, make sure that you are familiar with the **[study guide](https://liu-nlp.ai/text-mining/logistics/)**, in particular the rules around **cheating and plagiarism** (found in the course memo).
➡️ If you use code from external sources (e.g. StackOverflow, ChatGPT, ...) as part of your solutions, don't forget to add a reference to these source(s) (for example as a comment above your code).
➡️ If you use code from external sources (e.g. StackOverflow, ChatGPT, ...) as part of your solutions, don't forget to add a reference to these source(s) (for example as a comment above your code).
➡️ Make sure you fill in all cells that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**. You normally shouldn't need to modify any of the other cells.
➡️ Make sure you fill in all cells that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**. You normally shouldn't need to modify any of the other cells.
</div>
</div>
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# L4: Clustering and Topic Modelling
# L4: Clustering and Topic Modelling
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Text clustering groups documents in such a way that documents within a group are more ‘similar’ to other documents in the cluster than to documents not in the cluster. The exact definition of what ‘similar’ means in this context varies across applications and clustering algorithms.
Text clustering groups documents in such a way that documents within a group are more ‘similar’ to other documents in the cluster than to documents not in the cluster. The exact definition of what ‘similar’ means in this context varies across applications and clustering algorithms.
In this lab you will experiment with both hard and soft clustering techniques. More specifically, in the first part you will be using the $k$-means algorithm, and in the second part you will be using a topic model based on the Latent Dirichlet Allocation (LDA).
In this lab you will experiment with both hard and soft clustering techniques. More specifically, in the first part you will be using the $k$-means algorithm, and in the second part you will be using a topic model based on the Latent Dirichlet Allocation (LDA).
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# Define some helper functions that are used in this notebook
# Define some helper functions that are used in this notebook
%matplotlibinline
%matplotlibinline
fromIPython.displayimportdisplay,HTML
fromIPython.displayimportdisplay,HTML
defsuccess():
defsuccess():
display(HTML('<div class="alert alert-success"><strong>Checks have passed!</strong></div>'))
display(HTML('<div class="alert alert-success"><strong>Checks have passed!</strong></div>'))
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Dataset 1: Hard clustering
## Dataset 1: Hard clustering
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The raw data for the hard clustering part of this lab is a collection of product reviews. We have preprocessed the data by tokenization and lowercasing.
The raw data for the hard clustering part of this lab is a collection of product reviews. We have preprocessed the data by tokenization and lowercasing.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
importpandasaspd
importpandasaspd
importbz2
importbz2
withbz2.open('reviews.json.bz2')assource:
withbz2.open('reviews.json.bz2')assource:
df=pd.read_json(source)
df=pd.read_json(source)
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
When you inspect the data frame, you can see that there are three labelled columns: `category` (the product category), `sentiment` (whether the product review was classified as ‘positive’ or ‘negative’ towards the product), and `text` (the space-separated text of the review).
When you inspect the data frame, you can see that there are three labelled columns: `category` (the product category), `sentiment` (whether the product review was classified as ‘positive’ or ‘negative’ towards the product), and `text` (the space-separated text of the review).
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
pd.set_option('display.max_colwidth',None)
pd.set_option('display.max_colwidth',None)
df.head()
df.head()
```
```
%% Output
%% Output
category sentiment \
category sentiment \
0 music neg
0 music neg
1 music neg
1 music neg
2 books neg
2 books neg
3 books pos
3 books pos
4 dvd pos
4 dvd pos
text
text
0 i bought this album because i loved the title song . it 's such a great song , how bad can the rest of the album be , right ? well , the rest of the songs are just filler and are n't worth the money i paid for this . it 's either shameless bubblegum or oversentimentalized depressing tripe . kenny chesney is a popular artist and as a result he is in the cookie cutter category of the nashville music scene . he 's gotta pump out the albums so the record company can keep lining their pockets while the suckers out there keep buying this garbage to perpetuate more garbage coming out of that town . i 'll get down off my soapbox now . but country music really needs to get back to it 's roots and stop this pop nonsense . what country music really is and what it is considered to be by mainstream are two different things .
0 i bought this album because i loved the title song . it 's such a great song , how bad can the rest of the album be , right ? well , the rest of the songs are just filler and are n't worth the money i paid for this . it 's either shameless bubblegum or oversentimentalized depressing tripe . kenny chesney is a popular artist and as a result he is in the cookie cutter category of the nashville music scene . he 's gotta pump out the albums so the record company can keep lining their pockets while the suckers out there keep buying this garbage to perpetuate more garbage coming out of that town . i 'll get down off my soapbox now . but country music really needs to get back to it 's roots and stop this pop nonsense . what country music really is and what it is considered to be by mainstream are two different things .
1 i was misled and thought i was buying the entire cd and it contains one song
1 i was misled and thought i was buying the entire cd and it contains one song
2 i have introduced many of my ell , high school students to lois lowery and the depth of her characters . she is a brilliant writer and capable of inspiring fierce passion in her readers as they encounter shocking details of her utopian worlds . i was anxious to read this companion novel and had planned to share it with my class this january . although the series is written for 6th graders and older , this book 's simplicity , in its message , language and writing style will inspire no one . i am sadly disappointed
2 i have introduced many of my ell , high school students to lois lowery and the depth of her characters . she is a brilliant writer and capable of inspiring fierce passion in her readers as they encounter shocking details of her utopian worlds . i was anxious to read this companion novel and had planned to share it with my class this january . although the series is written for 6th graders and older , this book 's simplicity , in its message , language and writing style will inspire no one . i am sadly disappointed
3 anything you purchase in the left behind series is an excellent read . these books are great and very close to the bible . i have the entire set . amazon is a great shopping site and they ship fast . i would recommend these to any christian wanting to know about what to expect during the return of christ ! they are fiction but still makes a good point
3 anything you purchase in the left behind series is an excellent read . these books are great and very close to the bible . i have the entire set . amazon is a great shopping site and they ship fast . i would recommend these to any christian wanting to know about what to expect during the return of christ ! they are fiction but still makes a good point
4 i loved these movies , and i cant wiat for the third one ! very funny , not suitable for chilren
4 i loved these movies , and i cant wiat for the third one ! very funny , not suitable for chilren
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Problem 1: K-means clustering
## Problem 1: K-means clustering
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Your first task is to cluster the product review data using a tf–idf vectorizer and $k$-means clustering.
Your first task is to cluster the product review data using a tf–idf vectorizer and $k$-means clustering.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Task 1.1
### Task 1.1
Start by **performing tf–idf vectorization**. In connection with vectorization, you should also **filter out standard English stop words**. While you could use [spaCy](https://spacy.io/) for this task, here it suffices to use the word list implemented in [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).
Start by **performing tf–idf vectorization**. In connection with vectorization, you should also **filter out standard English stop words**. While you could use [spaCy](https://spacy.io/) for this task, here it suffices to use the word list implemented in [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).
After running the following cell:
After running the following cell:
-`vectorizer` should contain the vectorizer fitted on `df['text']`
-`vectorizer` should contain the vectorizer fitted on `df['text']`
-`reviews` should contain the vectorized `df['text']`
-`reviews` should contain the vectorized `df['text']`
Next, **write a function to cluster the vectorized data.** For this, you can use scikit-learn’s [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) class, which has several parameters that you can tweak, the most important one being the _number of clusters_. Your function should therefore take the number of clusters as an argument; you can leave all other parameters at their defaults.
Next, **write a function to cluster the vectorized data.** For this, you can use scikit-learn’s [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) class, which has several parameters that you can tweak, the most important one being the _number of clusters_. Your function should therefore take the number of clusters as an argument; you can leave all other parameters at their defaults.
%% Cell type:code id: tags:solution
%% Cell type:code id: tags:solution
``` python
``` python
fromsklearn.clusterimportKMeans
fromsklearn.clusterimportKMeans
deffit_kmeans(data,n_clusters):
deffit_kmeans(data,n_clusters):
"""Fit a k-means classifier to some data.
"""Fit a k-means classifier to some data.
Arguments:
Arguments:
data: The vectorized data to train the classifier on.
data: The vectorized data to train the classifier on.
n_clusters (int): The number of clusters.
n_clusters (int): The number of clusters.
Returns:
Returns:
The trained k-means classifier.
The trained k-means classifier.
"""
"""
kmeans=KMeans(n_clusters=n_clusters).fit(data)
kmeans=KMeans(n_clusters=n_clusters).fit(data)
returnkmeans
returnkmeans
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
To sanity-check your clustering, **create a bar plot** with the number of documents per cluster:
To sanity-check your clustering, **create a bar plot** with the number of documents per cluster:
%% Cell type:code id: tags:solution
%% Cell type:code id: tags:solution
``` python
``` python
importmatplotlib.pyplotasplt
importmatplotlib.pyplotasplt
importnumpyasnp
importnumpyasnp
defplot_cluster_size(kmeans):
defplot_cluster_size(kmeans):
"""Produce & display a bar plot with the number of documents per cluster.
"""Produce & display a bar plot with the number of documents per cluster.
The following cell shows how your code should run. The output of the cell should be the bar plot of the cluster sizes. Note that sizes may vary considerably between clusters and among different random seeds, so there is no single “correct” output here! Re-run the cell a couple of times to observe how the plot changes.
The following cell shows how your code should run. The output of the cell should be the bar plot of the cluster sizes. Note that sizes may vary considerably between clusters and among different random seeds, so there is no single “correct” output here! Re-run the cell a couple of times to observe how the plot changes.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
kmeans=fit_kmeans(reviews,3)
kmeans=fit_kmeans(reviews,3)
plot_cluster_size(kmeans)
plot_cluster_size(kmeans)
```
```
%% Output
%% Output
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Problem 2: Summarising clusters
## Problem 2: Summarising clusters
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Once you have a clustering, you can try to see whether it is meaningful. One useful technique in that context is to **generate a “summary”** for each cluster by extracting the $n$ highest-weighted terms from the centroid of each cluster. Your next task is to implement this approach.
Once you have a clustering, you can try to see whether it is meaningful. One useful technique in that context is to **generate a “summary”** for each cluster by extracting the $n$ highest-weighted terms from the centroid of each cluster. Your next task is to implement this approach.
Once you have computed the cluster summaries, take a minute to reflect on their quality. Is it clear what the reviews in a given cluster are about? Do the cluster summaries contain any unexpected terms?
Once you have computed the cluster summaries, take a minute to reflect on their quality. Is it clear what the reviews in a given cluster are about? Do the cluster summaries contain any unexpected terms?
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Problem 3: Evaluate clustering performance
## Problem 3: Evaluate clustering performance
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
In some scenarios, you may have gold-standard class labels available for at least a subset of your documents. In our case, we could use the gold-standard categories (from the `category` column) as class labels. This means we’re making the assumption that a “good” clustering should put texts into the same cluster _if and only if_ they belong to the same category.
In some scenarios, you may have gold-standard class labels available for at least a subset of your documents. In our case, we could use the gold-standard categories (from the `category` column) as class labels. This means we’re making the assumption that a “good” clustering should put texts into the same cluster _if and only if_ they belong to the same category.
If we have such class labels, we can compute a variety of performance measures to see how well our $k$-means clustering resembles the given class labels. Here, we will consider three of these measures: the **Rand index (RI)**; the **adjusted Rand index (RI)** which has been corrected for chance; and the **V-measure**. For all of them (and more), we can make use of [implementations by scikit-learn](https://scikit-learn.org/1.5/modules/clustering.html#clustering-performance-evaluation).
If we have such class labels, we can compute a variety of performance measures to see how well our $k$-means clustering resembles the given class labels. Here, we will consider three of these measures: the **Rand index (RI)**; the **adjusted Rand index (RI)** which has been corrected for chance; and the **V-measure**. For all of them (and more), we can make use of [implementations by scikit-learn](https://scikit-learn.org/1.5/modules/clustering.html#clustering-performance-evaluation).
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Your task is to **compare the performance** of different $k$-means clusterings with $k = 1, \ldots, 10$ clusters. As your evaluation data, use the _first 1000 documents_ from the original data set along with their gold-standard categories (from the `category` column).
Your task is to **compare the performance** of different $k$-means clusterings with $k = 1, \ldots, 10$ clusters. As your evaluation data, use the _first 1000 documents_ from the original data set along with their gold-standard categories (from the `category` column).
**Visualise your results as a line plot**, where
**Visualise your results as a line plot**, where
- the $x$-axis corresponds to $k$
- the $x$-axis corresponds to $k$
- the $y$-axis corresponds to the score of the evaluation measure
- the $y$-axis corresponds to the score of the evaluation measure
- each evaluation measure (RI, ARI, V) is shown by a differently-colored and/or -styled line in the plot
- each evaluation measure (RI, ARI, V) is shown by a differently-colored and/or -styled line in the plot
Remember that you may get different clusters each time you run the $k$-means algorithm, so re-run your solution above a few times to see how the results change. Take a moment to think how you would interpret these results; you will need this for the reflection.
Remember that you may get different clusters each time you run the $k$-means algorithm, so re-run your solution above a few times to see how the results change. Take a moment to think how you would interpret these results; you will need this for the reflection.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Dataset 2: Topic modelling
## Dataset 2: Topic modelling
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The data set for the topic modelling part of this lab is the collection of all [State of the Union](https://en.wikipedia.org/wiki/State_of_the_Union) addresses from the years 1975–2000. These speeches come as a single text file with one sentence per line. The following code cell prints the first 5 lines from the data file:
The data set for the topic modelling part of this lab is the collection of all [State of the Union](https://en.wikipedia.org/wiki/State_of_the_Union) addresses from the years 1975–2000. These speeches come as a single text file with one sentence per line. The following code cell prints the first 5 lines from the data file:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
fromitertoolsimportislice
fromitertoolsimportislice
withopen('sotu_1975_2000.txt')assource:
withopen('sotu_1975_2000.txt')assource:
# Print the first 5 lines only
# Print the first 5 lines only
forlineinislice(source,5):
forlineinislice(source,5):
print(line.rstrip())
print(line.rstrip())
```
```
%% Output
%% Output
mr speaker mr vice president members of the 94th congress and distinguished guests
mr speaker mr vice president members of the 94th congress and distinguished guests
twenty six years ago a freshman congressman a young fellow with lots of idealism who was out to change the world stood before sam rayburn in the well of the house and solemnly swore to the same oath that all of you took yesterday an unforgettable experience and i congratulate you all
twenty six years ago a freshman congressman a young fellow with lots of idealism who was out to change the world stood before sam rayburn in the well of the house and solemnly swore to the same oath that all of you took yesterday an unforgettable experience and i congratulate you all
two days later that same freshman stood at the back of this great chamber over there someplace as president truman all charged up by his single handed election victory reported as the constitution requires on the state of the union
two days later that same freshman stood at the back of this great chamber over there someplace as president truman all charged up by his single handed election victory reported as the constitution requires on the state of the union
when the bipartisan applause stopped president truman said i am happy to report to this 81st congress that the state of the union is good our nation is better able than ever before to meet the needs of the american people and to give them their fair chance in the pursuit of happiness it is foremost among the nations of the world in the search for peace
when the bipartisan applause stopped president truman said i am happy to report to this 81st congress that the state of the union is good our nation is better able than ever before to meet the needs of the american people and to give them their fair chance in the pursuit of happiness it is foremost among the nations of the world in the search for peace
today that freshman member from michigan stands where mr truman stood and i must say to you that the state of the union is not good
today that freshman member from michigan stands where mr truman stood and i must say to you that the state of the union is not good
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Take a few minutes to think about what topics you would expect in this data set.
Take a few minutes to think about what topics you would expect in this data set.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Problem 4: Train a topic model
## Problem 4: Train a topic model
In this problem, we will train an LDA model on the State of the Union (SOTU) dataset. For this, we will be using [spaCy](https://spacy.io/) and the [gensim](https://radimrehurek.com/gensim/) topic modelling library.
In this problem, we will train an LDA model on the State of the Union (SOTU) dataset. For this, we will be using [spaCy](https://spacy.io/) and the [gensim](https://radimrehurek.com/gensim/) topic modelling library.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Task 4.1: Preparing the data
### Task 4.1: Preparing the data
Start by **preprocessing the data** using spaCy as follows:
Start by **preprocessing the data** using spaCy as follows:
- Filter out stop words, non-alphabetic tokens, and tokens less than 3 characters in length.
- Filter out stop words, non-alphabetic tokens, and tokens less than 3 characters in length.
- Store the documents as a nested list where the first level of nesting corresponds to the sentences and the second level corresponds to the tokens in each sentence.
- Store the documents as a nested list where the first level of nesting corresponds to the sentences and the second level corresponds to the tokens in each sentence.
Test your preprocessing by running the following cell. It will output the tokens (after preprocessing) for an example document and compare them against the expected output.
Test your preprocessing by running the following cell. It will output the tokens (after preprocessing) for an example document and compare them against the expected output.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
documents=load_and_preprocess_documents()
documents=load_and_preprocess_documents()
print(f"Document 42 after preprocessing: {''.join(documents[42])}")
print(f"Document 42 after preprocessing: {''.join(documents[42])}")
assert"".join(documents[42])=="reduce oil imports million barrels day end year million barrels day end"
assert"".join(documents[42])=="reduce oil imports million barrels day end year million barrels day end"
success()
success()
```
```
%% Output
%% Output
Document 42 after preprocessing: reduce oil imports million barrels day end year million barrels day end
Document 42 after preprocessing: reduce oil imports million barrels day end year million barrels day end
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Task 4.2: Training LDA
### Task 4.2: Training LDA
Now that we have the list of documents, we can use gensim to train an LDA model on them. Gensim works a bit differently from scikit-learn and has its own interfaces, so you should skim the section [“Pre-process and vectorize the documents”](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#pre-process-and-vectorize-the-documents) of the documentation to learn how to create the dictionary and the vectorized corpus representation required by gensim.
Now that we have the list of documents, we can use gensim to train an LDA model on them. Gensim works a bit differently from scikit-learn and has its own interfaces, so you should skim the section [“Pre-process and vectorize the documents”](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#pre-process-and-vectorize-the-documents) of the documentation to learn how to create the dictionary and the vectorized corpus representation required by gensim.
Based on this, **write code to train an [LdaModel](https://radimrehurek.com/gensim/models/ldamodel.html)** for $k=10$ topics, and using default values for all other parameters.
Based on this, **write code to train an [LdaModel](https://radimrehurek.com/gensim/models/ldamodel.html)** for $k=10$ topics, and using default values for all other parameters.
Inspect the topics. Can you ‘label’ each topic with a short description of what it is about? Do the topics match your expectations?
Inspect the topics. Can you ‘label’ each topic with a short description of what it is about? Do the topics match your expectations?
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Problem 5: Monitor a topic model for convergence
## Problem 5: Monitor a topic model for convergence
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
When learning an LDA model, it is important to make sure that the training algorithm has converged to a stable posterior distribution. One way to do so is to plot, after each training epochs (or ‘pass’, in gensim parlance) the log likelihood of the training data under the posterior. Your last task in this lab is to create such a plot and, based on this, to suggest an appropriate number of epochs.
When learning an LDA model, it is important to make sure that the training algorithm has converged to a stable posterior distribution. One way to do so is to plot, after each training epochs (or ‘pass’, in gensim parlance) the log likelihood of the training data under the posterior. Your last task in this lab is to create such a plot and, based on this, to suggest an appropriate number of epochs.
To collect information about the posterior likelihood after each pass, we need to enable the logging facilities of gensim. Once this is done, gensim will add various diagnostics to a log file `gensim.log`.
To collect information about the posterior likelihood after each pass, we need to enable the logging facilities of gensim. Once this is done, gensim will add various diagnostics to a log file `gensim.log`.
Here's an example how to run it — note that we call `clear_logfile()` to empty the logfile before training the model. If your code from Problem 4 was correct, the result should be a list with a single log-likehood score, since we are doing a single training pass:
Here's an example how to run it — note that we call `clear_logfile()` to empty the logfile before training the model. If your code from Problem 4 was correct, the result should be a list with a single log-likehood score, since we are doing a single training pass:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
clear_logfile()
clear_logfile()
model=train_lda_model(documents,10,passes=1)
model=train_lda_model(documents,10,passes=1)
likelihoods=parse_logfile()
likelihoods=parse_logfile()
print(likelihoods)
print(likelihoods)
```
```
%% Output
%% Output
[-6.896]
[-6.896]
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Task 5.1: Plotting log-likelihoods
### Task 5.1: Plotting log-likelihoods
Your task now is to **re-train your LDA model for 50 passes**, retrieve the list of log likelihoods, and **create a plot** from this data.
Your task now is to **re-train your LDA model for 50 passes**, retrieve the list of log likelihoods, and **create a plot** from this data.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
clear_logfile()
clear_logfile()
model=train_lda_model(documents,10,passes=50)
model=train_lda_model(documents,10,passes=50)
likelihoods=parse_logfile()
likelihoods=parse_logfile()
plt.plot(likelihoods)
plt.plot(likelihoods)
plt.show()
plt.show()
```
```
%% Output
%% Output
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Task 5.2: Interpreting log-likelihoods
### Task 5.2: Interpreting log-likelihoods
How do you interpret the plot you produced in Task 5.1? Based on the plot, what would be a reasonable choice for the number of passes? **Retrain your LDA model with that number** and re-inspect the topics it finds.
How do you interpret the plot you produced in Task 5.1? Based on the plot, what would be a reasonable choice for the number of passes? **Retrain your LDA model with that number** and re-inspect the topics it finds.
<strong>After you have solved the lab,</strong> write a <em>brief</em> reflection (max. one A4 page) on the question(s) below. Remember:
<strong>After you have solved the lab,</strong> write a <em>brief</em> reflection (max. one A4 page) on the question(s) below. Remember:
<ul>
<ul>
<li>You are encouraged to discuss this part with your lab partner, but you should each write up your reflection <strong>individually</strong>.</li>
<li>You are encouraged to discuss this part with your lab partner, but you should each write up your reflection <strong>individually</strong>.</li>
<li><strong>Do not put your answers in the notebook</strong>; upload them in the separate submission opportunity for the reflections on Lisam.</li>
<li><strong>Do not put your answers in the notebook</strong>; upload them in the separate submission opportunity for the reflections on Lisam.</li>
</ul>
</ul>
</div>
</div>
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
1. In Problem 3, you performed an evaluation of $k$-means clustering with different values for $k$. How do you interpret the results? What would you expect to be a “good” number of clusters for this dataset? What do the evaluation measures suggest would be a “good” number of clusters?
1. In Problem 3, you performed an evaluation of $k$-means clustering with different values for $k$. How do you interpret the results? What would you expect to be a “good” number of clusters for this dataset? What do the evaluation measures suggest would be a “good” number of clusters?
2. How did you choose the number of LDA passes in Task 5.2? Do you consider the topic clusters you got in Task 5.2 to be “better” than the ones from Task 4.2? Base your reasoning on one or more concrete examples from the LDA output.
2. How did you choose the number of LDA passes in Task 5.2? Do you consider the topic clusters you got in Task 5.2 to be “better” than the ones from Task 4.2? Base your reasoning on one or more concrete examples from the LDA output.
➡️ Before you submit, **make sure the notebook can be run from start to finish** without errors. For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).
➡️ Before you submit, **make sure the notebook can be run from start to finish** without errors. For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).