done

4114ae3c · Filip Johnsson · 4b404d6b · 4114ae3c
Commit 4114ae3c authored 8 months ago by Filip Johnsson
--- a/l4/TM-Lab4.ipynb
+++ b/l4/TM-Lab4.ipynb
@@ -1098,7 +1098,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 182,
+   "execution_count": 186,
   "metadata": {
    "deletable": false,
    "nbgrader": {
@@ -1115,20 +1115,40 @@
   },
   "outputs": [
    {
-     "ename": "NotImplementedError",
+     "data": {
-     "evalue": "",
+      "text/plain": [
-     "output_type": "error",
+       "[(0,\n",
-     "traceback": [
+       "  '0.040*\"children\" + 0.037*\"people\" + 0.033*\"care\" + 0.030*\"health\" + 0.024*\"know\" + 0.019*\"child\" + 0.019*\"families\" + 0.019*\"americans\" + 0.017*\"year\" + 0.017*\"work\"'),\n",
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+       " (1,\n",
-      "\u001b[0;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+       "  '0.029*\"energy\" + 0.025*\"policy\" + 0.016*\"economic\" + 0.016*\"program\" + 0.014*\"states\" + 0.014*\"major\" + 0.014*\"administration\" + 0.014*\"united\" + 0.012*\"oil\" + 0.012*\"foreign\"'),\n",
-      "Cell \u001b[0;32mIn[182], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[39m# YOUR CODE HERE\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mNotImplementedError\u001b[39;00m()\n",
+       " (2,\n",
-      "\u001b[0;31mNotImplementedError\u001b[0m: "
+       "  '0.075*\"security\" + 0.035*\"social\" + 0.031*\"soviet\" + 0.027*\"defense\" + 0.024*\"years\" + 0.024*\"economic\" + 0.023*\"military\" + 0.020*\"percent\" + 0.020*\"forces\" + 0.016*\"growth\"'),\n",
-     ]
+       " (3,\n",
+       "  '0.038*\"tax\" + 0.036*\"education\" + 0.035*\"budget\" + 0.031*\"schools\" + 0.027*\"school\" + 0.022*\"year\" + 0.021*\"help\" + 0.016*\"billion\" + 0.016*\"children\" + 0.016*\"years\"'),\n",
+       " (4,\n",
+       "  '0.036*\"america\" + 0.030*\"world\" + 0.019*\"new\" + 0.018*\"people\" + 0.018*\"nation\" + 0.018*\"economy\" + 0.015*\"great\" + 0.014*\"americans\" + 0.014*\"years\" + 0.014*\"best\"'),\n",
+       " (5,\n",
+       "  '0.035*\"work\" + 0.034*\"new\" + 0.033*\"welfare\" + 0.028*\"jobs\" + 0.023*\"people\" + 0.020*\"years\" + 0.017*\"private\" + 0.015*\"inflation\" + 0.014*\"million\" + 0.013*\"rate\"'),\n",
+       " (6,\n",
+       "  '0.029*\"crime\" + 0.027*\"let\" + 0.025*\"americans\" + 0.024*\"people\" + 0.022*\"thank\" + 0.021*\"congress\" + 0.021*\"american\" + 0.020*\"president\" + 0.019*\"country\" + 0.019*\"government\"'),\n",
+       " (7,\n",
+       "  '0.044*\"federal\" + 0.039*\"government\" + 0.027*\"congress\" + 0.025*\"programs\" + 0.021*\"year\" + 0.018*\"system\" + 0.017*\"act\" + 0.014*\"administration\" + 0.014*\"legislation\" + 0.014*\"new\"'),\n",
+       " (8,\n",
+       "  '0.036*\"peace\" + 0.034*\"world\" + 0.028*\"nuclear\" + 0.023*\"america\" + 0.018*\"nations\" + 0.016*\"democracy\" + 0.016*\"new\" + 0.016*\"freedom\" + 0.015*\"states\" + 0.015*\"year\"'),\n",
+       " (9,\n",
+       "  '0.065*\"america\" + 0.032*\"century\" + 0.031*\"world\" + 0.029*\"state\" + 0.028*\"union\" + 0.021*\"american\" + 0.020*\"time\" + 0.019*\"new\" + 0.018*\"tonight\" + 0.017*\"science\"')]"
+      ]
+     },
+     "execution_count": 186,
+     "metadata": {},
+     "output_type": "execute_result"
    }
   ],
   "source": [
-    "# YOUR CODE HERE\n",
+    "clear_logfile()\n",
-    "raise NotImplementedError()"
+    "model = train_lda_model(documents, 10, passes=30)\n",
+    "likelihoods = parse_logfile()\n",
+    "model.print_topics()"
   ]
  },
  {

 %% Cell type:markdown id: tags:
 <div class="alert alert-info">
 ➡️ Before you start, make sure that you are familiar with the **[study guide](https://liu-nlp.ai/text-mining/logistics/)**, in particular the rules around **cheating and plagiarism** (found in the course memo).
 ➡️ If you use code from external sources (e.g. StackOverflow, ChatGPT, ...) as part of your solutions, don't forget to add a reference to these source(s) (for example as a comment above your code).
 ➡️ Make sure you fill in all cells that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**.  You normally shouldn't need to modify any of the other cells.
 </div>
 %% Cell type:markdown id: tags:
 # L4: Clustering and Topic Modelling
 %% Cell type:markdown id: tags:
 Text clustering groups documents in such a way that documents within a group are more &lsquo;similar&rsquo; to other documents in the cluster than to documents not in the cluster. The exact definition of what &lsquo;similar&rsquo; means in this context varies across applications and clustering algorithms.
 In this lab you will experiment with both hard and soft clustering techniques. More specifically, in the first part you will be using the $k$-means algorithm, and in the second part you will be using a topic model based on the Latent Dirichlet Allocation (LDA).
 %% Cell type:code id: tags:
 ``` python
 # Define some helper functions that are used in this notebook
 %matplotlib inline
 from IPython.display import display, HTML
 def success():
    display(HTML('<div class="alert alert-success"><strong>Checks have passed!</strong></div>'))
 ```
 %% Cell type:markdown id: tags:
 ## Dataset 1: Hard clustering
 %% Cell type:markdown id: tags:
 The raw data for the hard clustering part of this lab is a collection of product reviews. We have preprocessed the data by tokenization and lowercasing.
 %% Cell type:code id: tags:
 ``` python
 import pandas as pd
 import bz2
 with bz2.open('reviews.json.bz2') as source:
    df = pd.read_json(source)
 ```
 %% Cell type:markdown id: tags:
 When you inspect the data frame, you can see that there are three labelled columns: `category` (the product category), `sentiment` (whether the product review was classified as &lsquo;positive&rsquo; or &lsquo;negative&rsquo; towards the product), and `text` (the space-separated text of the review).
 %% Cell type:code id: tags:
 ``` python
 pd.set_option('display.max_colwidth', None)
 df.head()
 ```
 %% Output
      category sentiment  \
    0    music       neg
    1    music       neg
    2    books       neg
    3    books       pos
    4      dvd       pos
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          text
    0  i bought this album because i loved the title song . it 's such a great song , how bad can the rest of the album be , right ? well , the rest of the songs are just filler and are n't worth the money i paid for this . it 's either shameless bubblegum or oversentimentalized depressing tripe . kenny chesney is a popular artist and as a result he is in the cookie cutter category of the nashville music scene . he 's gotta pump out the albums so the record company can keep lining their pockets while the suckers out there keep buying this garbage to perpetuate more garbage coming out of that town . i 'll get down off my soapbox now . but country music really needs to get back to it 's roots and stop this pop nonsense . what country music really is and what it is considered to be by mainstream are two different things .
    1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             i was misled and thought i was buying the entire cd and it contains one song
    2                                                                                                                                                                                                                                                                                                                  i have introduced many of my ell , high school students to lois lowery and the depth of her characters . she is a brilliant writer and capable of inspiring fierce passion in her readers as they encounter shocking details of her utopian worlds . i was anxious to read this companion novel and had planned to share it with my class this january . although the series is written for 6th graders and older , this book 's simplicity , in its message , language and writing style will inspire no one . i am sadly disappointed
    3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        anything you purchase in the left behind series is an excellent read . these books are great and very close to the bible . i have the entire set . amazon is a great shopping site and they ship fast . i would recommend these to any christian wanting to know about what to expect during the return of christ ! they are fiction but still makes a good point
    4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         i loved these movies , and i cant wiat for the third one ! very funny , not suitable for chilren
 %% Cell type:markdown id: tags:
 ## Problem 1: K-means clustering
 %% Cell type:markdown id: tags:
 Your first task is to cluster the product review data using a tf–idf vectorizer and $k$-means clustering.
 %% Cell type:markdown id: tags:
 ### Task 1.1
 Start by **performing tf–idf vectorization**. In connection with vectorization, you should also **filter out standard English stop words**. While you could use [spaCy](https://spacy.io/) for this task, here it suffices to use the word list implemented in [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).
 After running the following cell:
 - `vectorizer` should contain the vectorizer fitted on `df['text']`
 - `reviews` should contain the vectorized `df['text']`
 %% Cell type:code id: tags:
 ``` python
 from sklearn.feature_extraction.text import TfidfVectorizer
 from sklearn.feature_extraction import text
 vectorizer = TfidfVectorizer(stop_words=list(text.ENGLISH_STOP_WORDS))
 reviews = vectorizer.fit_transform(df['text'])
 print(reviews)
 ```
 %% Output
      (0, 5852)	0.06504921495797875
      (0, 2193)	0.1471548307342515
      (0, 24915)	0.09109607037535733
      (0, 42017)	0.10000216663596506
      (0, 38646)	0.16424440693327647
      (0, 18604)	0.05077433612468867
      (0, 4137)	0.07175371390261905
      (0, 34910)	0.18367160329602414
      (0, 35298)	0.07348340148157195
      (0, 38651)	0.07703664761414432
      (0, 22961)	0.04665955586418074
      (0, 16212)	0.13912816064642342
      (0, 46001)	0.07547768108871165
      (0, 27240)	0.07027979137637975
      (0, 29992)	0.10498855014070822
      (0, 37262)	0.15862890511379435
      (0, 6359)	0.16847668322988485
      (0, 29829)	0.18837129365521577
      (0, 11631)	0.13389849435116866
      (0, 42711)	0.15441726046867182
      (0, 23154)	0.16025244813278974
      (0, 7857)	0.17431187089400274
      (0, 31744)	0.11208114345458911
      (0, 3360)	0.11013363412464255
      (0, 34949)	0.10288475324710696
      :	:
      (11913, 32246)	0.08886417613773476
      (11913, 6242)	0.06577162401859972
      (11913, 42531)	0.1406992182866755
      (11913, 11469)	0.08342074299088129
      (11913, 41773)	0.07879602991267151
      (11913, 19227)	0.0853074055898735
      (11913, 32753)	0.09510778303071214
      (11913, 8800)	0.09930457830889507
      (11913, 1947)	0.07308333003447777
      (11913, 41769)	0.08105956275801945
      (11913, 4879)	0.15141787167451937
      (11913, 29254)	0.09773338883349404
      (11913, 31201)	0.09634750149743919
      (11913, 32401)	0.09202068895530031
      (11913, 28864)	0.07593741763117687
      (11913, 30563)	0.17275563960467433
      (11913, 5388)	0.38043113212284857
      (11913, 28814)	0.08637781980233716
      (11913, 46061)	0.08957751227447372
      (11913, 14405)	0.09114870174987476
      (11913, 35060)	0.10111838344497676
      (11913, 25541)	0.10111838344497676
      (11913, 46253)	0.10111838344497676
      (11913, 44924)	0.10927426000399708
      (11913, 35290)	0.10927426000399708
 %% Cell type:markdown id: tags:
 ### Task 1.2
 Next, **write a function to cluster the vectorized data.**  For this, you can use scikit-learn’s [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) class, which has several parameters that you can tweak, the most important one being the _number of clusters_.  Your function should therefore take the number of clusters as an argument; you can leave all other parameters at their defaults.
 %% Cell type:code id: tags:solution
 ``` python
 from sklearn.cluster import KMeans
 def fit_kmeans(data, n_clusters):
    """Fit a k-means classifier to some data.
    Arguments:
        data: The vectorized data to train the classifier on.
        n_clusters (int): The number of clusters.
    Returns:
        The trained k-means classifier.
    """
    kmeans = KMeans(n_clusters=n_clusters).fit(data)
    return kmeans
 ```
 %% Cell type:markdown id: tags:
 To sanity-check your clustering, **create a bar plot** with the number of documents per cluster:
 %% Cell type:code id: tags:solution
 ``` python
 import matplotlib.pyplot as plt
 import numpy as np
 def plot_cluster_size(kmeans):
    """Produce & display a bar plot with the number of documents per cluster.
    Arguments:
        kmeans: The trained k-means classifier.
    """
    clusters_amounts = np.unique(kmeans.labels_, return_counts=True)
    plt.bar(clusters_amounts[0], clusters_amounts[1])
    plt.show()
 ```
 %% Cell type:markdown id: tags:
 #### 🤞 Test your code
 The following cell shows how your code should run.  The output of the cell should be the bar plot of the cluster sizes.  Note that sizes may vary considerably between clusters and among different random seeds, so there is no single “correct” output here!  Re-run the cell a couple of times to observe how the plot changes.
 %% Cell type:code id: tags:
 ``` python
 kmeans = fit_kmeans(reviews, 3)
 plot_cluster_size(kmeans)
 ```
 %% Output
 %% Cell type:markdown id: tags:
 ## Problem 2: Summarising clusters
 %% Cell type:markdown id: tags:
 Once you have a clustering, you can try to see whether it is meaningful. One useful technique in that context is to **generate a “summary”** for each cluster by extracting the $n$ highest-weighted terms from the centroid of each cluster. Your next task is to implement this approach.
 %% Cell type:code id: tags:
 ``` python
 import numpy as np
 def compute_cluster_summaries(kmeans, vectorizer, top_n):
    """Compute the top_n highest-weighted terms from the centroid of each cluster.
    Arguments:
        kmeans: The trained k-means classifier.
        vectorizer: The fitted vectorizer; needed to obtain the actual terms
                    belonging to the items in the cluster.
        top_n: The number of terms to return for each cluster.
    Returns:
        A list of length k, where k is the number of clusters. Each item in the list
        should be a list of length `top_n` with the highest-weighted terms from that
        cluster.  Example:
          [["first", "foo", ...], ["second", "bar", ...], ["third", "baz", ...]]
    """
    centroids = kmeans.cluster_centers_
    feature_names = vectorizer.get_feature_names_out()
    for centroid in centroids:
        top_term_indexes = np.argsort(centroid)[::-1][:top_n]
        top_terms = [feature_names[i] for i in top_term_indexes]
        yield top_terms
 ```
 %% Cell type:markdown id: tags:solution
 ### 🤞 Test your code
 The following cell runs your code with `top_n=10` and prints the summaries:
 %% Cell type:code id: tags:solution
 ``` python
 summaries = compute_cluster_summaries(kmeans, vectorizer, 10)
 for idx, terms in enumerate(summaries):
   print(f"Cluster {idx}: {', '.join(terms)}")
 ```
 %% Output
    Cluster 0: camera, product, use, lens, software, great, easy, pictures, does, good
    Cluster 1: movie, cd, album, like, music, just, great, good, film, songs
    Cluster 2: book, read, books, author, reading, story, like, quot, just, good
 %% Cell type:markdown id: tags:
 Once you have computed the cluster summaries, take a minute to reflect on their quality. Is it clear what the reviews in a given cluster are about? Do the cluster summaries contain any unexpected terms?
 %% Cell type:markdown id: tags:
 ## Problem 3: Evaluate clustering performance
 %% Cell type:markdown id: tags:
 In some scenarios, you may have gold-standard class labels available for at least a subset of your documents.  In our case, we could use the gold-standard categories (from the `category` column) as class labels.  This means we’re making the assumption that a “good” clustering should put texts into the same cluster _if and only if_ they belong to the same category.
 If we have such class labels, we can compute a variety of performance measures to see how well our $k$-means clustering resembles the given class labels.  Here, we will consider three of these measures: the **Rand index (RI)**; the **adjusted Rand index (RI)** which has been corrected for chance; and the **V-measure**.  For all of them (and more), we can make use of [implementations by scikit-learn](https://scikit-learn.org/1.5/modules/clustering.html#clustering-performance-evaluation).
 %% Cell type:markdown id: tags:
 Your task is to **compare the performance** of different $k$-means clusterings with $k = 1, \ldots, 10$ clusters.  As your evaluation data, use the _first 1000 documents_ from the original data set along with their gold-standard categories (from the `category` column).
 **Visualise your results as a line plot**, where
 - the $x$-axis corresponds to $k$
 - the $y$-axis corresponds to the score of the evaluation measure
 - each evaluation measure (RI, ARI, V) is shown by a differently-colored and/or -styled line in the plot
 %% Cell type:code id: tags:solution
 ``` python
 from sklearn.metrics import rand_score, adjusted_rand_score, v_measure_score
 import seaborn as sns
 sns.set()
 data = df[:1000]
 reviews = vectorizer.fit_transform(data['text'])
 feature_names = vectorizer.get_feature_names_out()
 rand_scores = []
 adjusted_rand_scores = []
 v_measure_scores = []
 for k in range(1, 11):
    kmeans = fit_kmeans(reviews, k)
    labels_pred = kmeans.labels_#[feature_names[i] for i in kmeans.labels_]
    labels_true = data['category']
    rand_scores.append(rand_score(labels_true, labels_pred))
    adjusted_rand_scores.append(adjusted_rand_score(labels_true, labels_pred))
    v_measure_scores.append(v_measure_score(labels_true, labels_pred))
 plt.plot(range(1, 11), rand_scores)
 plt.plot(range(1, 11), adjusted_rand_scores)
 plt.plot(range(1, 11), v_measure_scores)
 plt.show()
 ```
 %% Output
 %% Cell type:markdown id: tags:
 Remember that you may get different clusters each time you run the $k$-means algorithm, so re-run your solution above a few times to see how the results change.  Take a moment to think how you would interpret these results; you will need this for the reflection.
 %% Cell type:markdown id: tags:
 ## Dataset 2: Topic modelling
 %% Cell type:markdown id: tags:
 The data set for the topic modelling part of this lab is the collection of all [State of the Union](https://en.wikipedia.org/wiki/State_of_the_Union) addresses from the years 1975–2000. These speeches come as a single text file with one sentence per line. The following code cell prints the first 5 lines from the data file:
 %% Cell type:code id: tags:
 ``` python
 from itertools import islice
 with open('sotu_1975_2000.txt') as source:
    # Print the first 5 lines only
    for line in islice(source, 5):
        print(line.rstrip())
 ```
 %% Output
    mr speaker mr vice president members of the 94th congress and distinguished guests
    twenty six years ago a freshman congressman a young fellow with lots of idealism who was out to change the world stood before sam rayburn in the well of the house and solemnly swore to the same oath that all of you took yesterday an unforgettable experience and i congratulate you all
    two days later that same freshman stood at the back of this great chamber over there someplace as president truman all charged up by his single handed election victory reported as the constitution requires on the state of the union
    when the bipartisan applause stopped president truman said i am happy to report to this 81st congress that the state of the union is good our nation is better able than ever before to meet the needs of the american people and to give them their fair chance in the pursuit of happiness it is foremost among the nations of the world in the search for peace
    today that freshman member from michigan stands where mr truman stood and i must say to you that the state of the union is not good
 %% Cell type:markdown id: tags:
 Take a few minutes to think about what topics you would expect in this data set.
 %% Cell type:markdown id: tags:
 ## Problem 4: Train a topic model
 In this problem, we will train an LDA model on the State of the Union&nbsp;(SOTU) dataset. For this, we will be using [spaCy](https://spacy.io/) and the [gensim](https://radimrehurek.com/gensim/) topic modelling library.
 %% Cell type:markdown id: tags:
 ### Task 4.1: Preparing the data
 Start by **preprocessing the data** using spaCy as follows:
 - Filter out stop words, non-alphabetic tokens, and tokens less than 3 characters in length.
 - Store the documents as a nested list where the first level of nesting corresponds to the sentences and the second level corresponds to the tokens in each sentence.
 %% Cell type:code id: tags:
 ``` python
 import spacy
 nlp = spacy.load("en_core_web_sm")
 ```
 %% Cell type:code id: tags:
 ``` python
 def load_and_preprocess_documents(filename="sotu_1975_2000.txt"):
    """Load and preprocess all documents in the given file.
    The preprocessing must filter out stop words, non-alphabetic tokens,
    and tokens less than 3 characters in length.
    Returns:
        A list of length n, where n is the number of documents.
        Each item in the list should be a list of tokens in the given
        document, after preprocessing.
    """
    new_source = []
    with open('sotu_1975_2000.txt') as source:
        for line in source:
            doc = nlp(line)
            new_source.append([token.text for token in doc if not token.is_stop and token.is_alpha and len(token.text) >= 3])
    return new_source
 ```
 %% Cell type:markdown id: tags:
 #### 🤞 Test your code
 Test your preprocessing by running the following cell. It will output the tokens (after preprocessing) for an example document and compare them against the expected output.
 %% Cell type:code id: tags:
 ``` python
 documents = load_and_preprocess_documents()
 print(f"Document 42 after preprocessing: {' '.join(documents[42])}")
 assert " ".join(documents[42]) == "reduce oil imports million barrels day end year million barrels day end"
 success()
 ```
 %% Output
    Document 42 after preprocessing: reduce oil imports million barrels day end year million barrels day end
 %% Cell type:markdown id: tags:
 ### Task 4.2: Training LDA
 Now that we have the list of documents, we can use gensim to train an LDA model on them.  Gensim works a bit differently from scikit-learn and has its own interfaces, so you should skim the section [“Pre-process and vectorize the documents”](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#pre-process-and-vectorize-the-documents) of the documentation to learn how to create the dictionary and the vectorized corpus representation required by gensim.
 Based on this, **write code to train an [LdaModel](https://radimrehurek.com/gensim/models/ldamodel.html)** for $k=10$ topics, and using default values for all other parameters.
 %% Cell type:code id: tags:solution
 ``` python
 from gensim.corpora.dictionary import Dictionary
 from gensim.models import LdaModel
 def train_lda_model(documents, num_topics, passes=1):
    """Create and train an LDA model.
    Arguments:
        documents: The preprocessed documents, as produced in Task 4.1.
        num_topics: The number of topics to generate.
        passes: The number of training passes. Defaults to 1; you will need
                this later for Task 5.
    Returns:
        The trained LDA model.
    """
    dictionary = Dictionary(documents)
    dictionary.filter_extremes(no_below=20, no_above=0.5)
    temp = dictionary[0]  # Load the dictionary
    id2word = dictionary.id2token
    corpus = [dictionary.doc2bow(doc) for doc in documents]
    model = LdaModel(
        corpus=corpus,
        id2word=id2word,
        # chunksize=2000,
        # alpha='auto',
        # eta='auto',
        # iterations=400,
        num_topics=num_topics,
        passes=passes,
        # eval_every=None
    )
    return model
 ```
 %% Cell type:markdown id: tags:
 #### 🤞 Test your code
 Run the following cell to test your code and print the topics:
 %% Cell type:code id: tags:
 ``` python
 model = train_lda_model(documents, 10)
 model.print_topics()
 ```
 %% Output
    [(0,
      '0.035*"children" + 0.027*"care" + 0.022*"health" + 0.020*"work" + 0.017*"americans" + 0.017*"people" + 0.015*"welfare" + 0.013*"parents" + 0.013*"child" + 0.013*"help"'),
     (1,
      '0.035*"new" + 0.030*"america" + 0.020*"century" + 0.020*"world" + 0.016*"people" + 0.012*"community" + 0.012*"security" + 0.011*"year" + 0.011*"country" + 0.010*"challenge"'),
     (2,
      '0.025*"budget" + 0.023*"year" + 0.020*"government" + 0.013*"people" + 0.013*"years" + 0.013*"time" + 0.011*"new" + 0.010*"deficit" + 0.009*"spending" + 0.009*"national"'),
     (3,
      '0.028*"know" + 0.015*"america" + 0.015*"opportunity" + 0.015*"teachers" + 0.013*"world" + 0.012*"tonight" + 0.012*"days" + 0.012*"children" + 0.010*"laws" + 0.009*"new"'),
     (4,
      '0.026*"work" + 0.026*"people" + 0.020*"world" + 0.015*"new" + 0.015*"year" + 0.014*"time" + 0.013*"good" + 0.012*"tonight" + 0.011*"like" + 0.008*"congress"'),
     (5,
      '0.019*"security" + 0.015*"social" + 0.012*"congress" + 0.011*"america" + 0.011*"people" + 0.010*"ask" + 0.010*"world" + 0.009*"hand" + 0.009*"responsibility" + 0.009*"long"'),
     (6,
      '0.021*"america" + 0.020*"congress" + 0.016*"world" + 0.015*"peace" + 0.015*"people" + 0.014*"new" + 0.012*"american" + 0.011*"nation" + 0.011*"help" + 0.010*"communities"'),
     (7,
      '0.029*"america" + 0.029*"years" + 0.021*"american" + 0.021*"let" + 0.017*"people" + 0.014*"work" + 0.012*"americans" + 0.012*"congress" + 0.012*"new" + 0.010*"security"'),
     (8,
      '0.020*"federal" + 0.020*"government" + 0.018*"new" + 0.015*"people" + 0.013*"nuclear" + 0.012*"congress" + 0.011*"high" + 0.011*"americans" + 0.010*"help" + 0.010*"trade"'),
     (9,
      '0.021*"education" + 0.019*"schools" + 0.018*"college" + 0.018*"tax" + 0.017*"school" + 0.016*"families" + 0.016*"years" + 0.016*"children" + 0.014*"year" + 0.012*"child"')]
 %% Cell type:markdown id: tags:
 Inspect the topics. Can you &lsquo;label&rsquo; each topic with a short description of what it is about? Do the topics match your expectations?
 %% Cell type:markdown id: tags:
 ## Problem 5: Monitor a topic model for convergence
 %% Cell type:markdown id: tags:
 When learning an LDA model, it is important to make sure that the training algorithm has converged to a stable posterior distribution. One way to do so is to plot, after each training epochs (or &lsquo;pass&rsquo;, in gensim parlance) the log likelihood of the training data under the posterior. Your last task in this lab is to create such a plot and, based on this, to suggest an appropriate number of epochs.
 To collect information about the posterior likelihood after each pass, we need to enable the logging facilities of gensim. Once this is done, gensim will add various diagnostics to a log file `gensim.log`.
 %% Cell type:code id: tags:
 ``` python
 import logging
 logging.basicConfig(filename='gensim.log', format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)
 def clear_logfile():
    # To empty the log file
    with open("gensim.log", "w"):
        pass
 ```
 %% Cell type:markdown id: tags:
 The following function will parse the generated logfile and return the list of log likelihoods.
 %% Cell type:code id: tags:
 ``` python
 import re
 def parse_logfile():
    """Parse gensim.log to extract the log-likelihood scores.
    Returns:
        A list of log-likelihood scores.
    """
    matcher = re.compile(r'(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity')
    likelihoods = []
    with open('gensim.log') as source:
        for line in source:
            match = matcher.search(line)
            if match:
                likelihoods.append(float(match.group(1)))
    return likelihoods
 ```
 %% Cell type:markdown id: tags:
 Here's an example how to run it — note that we call `clear_logfile()` to empty the logfile before training the model. If your code from Problem&nbsp;4 was correct, the result should be a list with a single log-likehood score, since we are doing a single training pass:
 %% Cell type:code id: tags:
 ``` python
 clear_logfile()
 model = train_lda_model(documents, 10, passes=1)
 likelihoods = parse_logfile()
 print(likelihoods)
 ```
 %% Output
    [-6.896]
 %% Cell type:markdown id: tags:
 ### Task 5.1: Plotting log-likelihoods
 Your task now is to **re-train your LDA model for 50&nbsp;passes**, retrieve the list of log likelihoods, and **create a plot** from this data.
 %% Cell type:code id: tags:
 ``` python
 clear_logfile()
 model = train_lda_model(documents, 10, passes=50)
 likelihoods = parse_logfile()
 plt.plot(likelihoods)
 plt.show()
 ```
 %% Output
 %% Cell type:markdown id: tags:
 ### Task 5.2: Interpreting log-likelihoods
 How do you interpret the plot you produced in Task 5.1? Based on the plot, what would be a reasonable choice for the number of passes? **Retrain your LDA model with that number** and re-inspect the topics it finds.
 %% Cell type:code id: tags:
 ``` python
-# YOUR CODE HERE
+clear_logfile()
-raise NotImplementedError()
+model = train_lda_model(documents, 10, passes=30)
+likelihoods = parse_logfile()
+model.print_topics()
 ```
 %% Output
-    ---------------------------------------------------------------------------
+    [(0,
-    NotImplementedError                       Traceback (most recent call last)
+      '0.040*"children" + 0.037*"people" + 0.033*"care" + 0.030*"health" + 0.024*"know" + 0.019*"child" + 0.019*"families" + 0.019*"americans" + 0.017*"year" + 0.017*"work"'),
-Cell     In[182], line 2
+     (1,
-          1 # YOUR CODE HERE
+      '0.029*"energy" + 0.025*"policy" + 0.016*"economic" + 0.016*"program" + 0.014*"states" + 0.014*"major" + 0.014*"administration" + 0.014*"united" + 0.012*"oil" + 0.012*"foreign"'),
-    ----> 2 raise NotImplementedError()
+     (2,
-    NotImplementedError:
+      '0.075*"security" + 0.035*"social" + 0.031*"soviet" + 0.027*"defense" + 0.024*"years" + 0.024*"economic" + 0.023*"military" + 0.020*"percent" + 0.020*"forces" + 0.016*"growth"'),
+     (3,
+      '0.038*"tax" + 0.036*"education" + 0.035*"budget" + 0.031*"schools" + 0.027*"school" + 0.022*"year" + 0.021*"help" + 0.016*"billion" + 0.016*"children" + 0.016*"years"'),
+     (4,
+      '0.036*"america" + 0.030*"world" + 0.019*"new" + 0.018*"people" + 0.018*"nation" + 0.018*"economy" + 0.015*"great" + 0.014*"americans" + 0.014*"years" + 0.014*"best"'),
+     (5,
+      '0.035*"work" + 0.034*"new" + 0.033*"welfare" + 0.028*"jobs" + 0.023*"people" + 0.020*"years" + 0.017*"private" + 0.015*"inflation" + 0.014*"million" + 0.013*"rate"'),
+     (6,
+      '0.029*"crime" + 0.027*"let" + 0.025*"americans" + 0.024*"people" + 0.022*"thank" + 0.021*"congress" + 0.021*"american" + 0.020*"president" + 0.019*"country" + 0.019*"government"'),
+     (7,
+      '0.044*"federal" + 0.039*"government" + 0.027*"congress" + 0.025*"programs" + 0.021*"year" + 0.018*"system" + 0.017*"act" + 0.014*"administration" + 0.014*"legislation" + 0.014*"new"'),
+     (8,
+      '0.036*"peace" + 0.034*"world" + 0.028*"nuclear" + 0.023*"america" + 0.018*"nations" + 0.016*"democracy" + 0.016*"new" + 0.016*"freedom" + 0.015*"states" + 0.015*"year"'),
+     (9,
+      '0.065*"america" + 0.032*"century" + 0.031*"world" + 0.029*"state" + 0.028*"union" + 0.021*"american" + 0.020*"time" + 0.019*"new" + 0.018*"tonight" + 0.017*"science"')]
 %% Cell type:markdown id: tags:
 ## Individual reflection
 <div class="alert alert-info">
    <strong>After you have solved the lab,</strong> write a <em>brief</em> reflection (max. one A4 page) on the question(s) below.  Remember:
    <ul>
        <li>You are encouraged to discuss this part with your lab partner, but you should each write up your reflection <strong>individually</strong>.</li>
        <li><strong>Do not put your answers in the notebook</strong>; upload them in the separate submission opportunity for the reflections on Lisam.</li>
    </ul>
 </div>
 %% Cell type:markdown id: tags:
 1. In Problem&nbsp;3, you performed an evaluation of $k$-means clustering with different values for $k$.  How do you interpret the results?  What would you expect to be a “good” number of clusters for this dataset?  What do the evaluation measures suggest would be a “good” number of clusters?
 2. How did you choose the number of LDA passes in Task&nbsp;5.2?  Do you consider the topic clusters you got in Task&nbsp;5.2 to be “better” than the ones from Task&nbsp;4.2?  Base your reasoning on one or more concrete examples from the LDA output.
 %% Cell type:markdown id:125ccdbd-4375-4d2f-8b1d-f47097ef2e84 tags:
 **Congratulations on finishing this lab! 👍**
 <div class="alert alert-info">
 ➡️ Before you submit, **make sure the notebook can be run from start to finish** without errors.  For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).
 </div>
 %% Cell type:code id:f3ad192d-7557-4cd9-9ead-6699b8de9114 tags:
 ``` python
 ```