➡️ Before you start, make sure that you are familiar with the **[study guide](https://liu-nlp.ai/text-mining/logistics/)**, in particular the rules around **cheating and plagiarism** (found in the course memo).
➡️ Before you start, make sure that you are familiar with the **[study guide](https://liu-nlp.ai/text-mining/logistics/)**, in particular the rules around **cheating and plagiarism** (found in the course memo).
➡️ If you use code from external sources (e.g. StackOverflow, ChatGPT, ...) as part of your solutions, don't forget to add a reference to these source(s) (for example as a comment above your code).
➡️ If you use code from external sources (e.g. StackOverflow, ChatGPT, ...) as part of your solutions, don't forget to add a reference to these source(s) (for example as a comment above your code).
➡️ Make sure you fill in all cells that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**. You normally shouldn't need to modify any of the other cells.
➡️ Make sure you fill in all cells that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**. You normally shouldn't need to modify any of the other cells.
</div>
</div>
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# L3: Information Extraction
# L3: Information Extraction
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Information extraction (IE) is the task of identifying named entities and semantic relations between these entities in text data. In this lab we will focus on two sub-tasks in IE, **named entity recognition** (identifying mentions of entities) and **entity linking** (matching these mentions to entities in a knowledge base).
Information extraction (IE) is the task of identifying named entities and semantic relations between these entities in text data. In this lab we will focus on two sub-tasks in IE, **named entity recognition** (identifying mentions of entities) and **entity linking** (matching these mentions to entities in a knowledge base).
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
# Define some helper functions that are used in this notebook
# Define some helper functions that are used in this notebook
from IPython.display import display, HTML
from IPython.display import display, HTML
def success():
def success():
display(HTML('<div class="alert alert-success"><strong>Checks have passed!</strong></div>'))
display(HTML('<div class="alert alert-success"><strong>Checks have passed!</strong></div>'))
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Dataset
## Dataset
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The main data set for this lab is a collection of news wire articles in which mentions of **named entities** have been annotated with **page names** from the [English Wikipedia](https://en.wikipedia.org/wiki/). The next code cell loads the training and the development parts of the data into Pandas data frames.
The main data set for this lab is a collection of news wire articles in which mentions of **named entities** have been annotated with **page names** from the [English Wikipedia](https://en.wikipedia.org/wiki/). The next code cell loads the training and the development parts of the data into Pandas data frames.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
import bz2
import bz2
import csv
import csv
import pandas as pd
import pandas as pd
import numpy as np
import numpy as np
with bz2.open('ner-train.tsv.bz2', mode='rt', encoding='utf-8') as source:
with bz2.open('ner-train.tsv.bz2', mode='rt', encoding='utf-8') as source:
Note that for implementing this function, it doesn’t matter what exactly `gold` and `pred` will contain, except that they will be Python `set` objects.
Note that for implementing this function, it doesn’t matter what exactly `gold` and `pred` will contain, except that they will be Python `set` objects.
Let's also define a convenience function that prints the scores nicely:
Let's also define a convenience function that prints the scores nicely:
result = evaluation_scores(example_gold, example_pred)
result = evaluation_scores(example_gold, example_pred)
print_evaluation_scores(result)
print_evaluation_scores(result)
# Check if the scores appear correct
# Check if the scores appear correct
assert np.isclose(result, (.5, 1. / 3, .4)).all(), "Should be close to the expected values"
assert np.isclose(result, (.5, 1. / 3, .4)).all(), "Should be close to the expected values"
success()
success()
```
```
%% Output
%% Output
Precision: 0.500, Recall: 0.333, F1: 0.400
Precision: 0.500, Recall: 0.333, F1: 0.400
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Problem 2: Named entity recognition
## Problem 2: Named entity recognition
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
One of the first tasks that an information extraction system has to solve is to locate and classify (mentions of) named entities, such as persons and organizations, a task usually known as **named entity recognition (NER)**. For this lab, we will consider a slightly simplified version of NER, by only looking at the _spans_ of tokens containing an entity mention, without the actual entity label.
One of the first tasks that an information extraction system has to solve is to locate and classify (mentions of) named entities, such as persons and organizations, a task usually known as **named entity recognition (NER)**. For this lab, we will consider a slightly simplified version of NER, by only looking at the _spans_ of tokens containing an entity mention, without the actual entity label.
The English language models in spaCy feature a full-fledged [named entity recognizer](https://spacy.io/usage/linguistic-features#named-entities) that identifies a variety of entities, and can be updated with new entity types by the user. We therefore start by loading spaCy. _However,_ the data that we will be using has already been tokenized (following the conventions of the [Penn Treebank](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html)), so we need to prevent spaCy from using its own tokenizer on top of this. We therefore override spaCy’s tokenizer with the default one that simply splits on whitespace:
The English language models in spaCy feature a full-fledged [named entity recognizer](https://spacy.io/usage/linguistic-features#named-entities) that identifies a variety of entities, and can be updated with new entity types by the user. We therefore start by loading spaCy. _However,_ the data that we will be using has already been tokenized (following the conventions of the [Penn Treebank](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html)), so we need to prevent spaCy from using its own tokenizer on top of this. We therefore override spaCy’s tokenizer with the default one that simply splits on whitespace:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
import spacy
import spacy
from spacy.tokenizer import Tokenizer
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_md') # Let’s use the "medium" (md) model this time
nlp = spacy.load('en_core_web_md') # Let’s use the "medium" (md) model this time
nlp.tokenizer = Tokenizer(nlp.vocab) # ...but override the tokenizer
nlp.tokenizer = Tokenizer(nlp.vocab) # ...but override the tokenizer
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Your task in this problem is to **evaluate the performance of spaCy’s NER component** when predicting entity spans in the **development data**.
Your task in this problem is to **evaluate the performance of spaCy’s NER component** when predicting entity spans in the **development data**.
This can be done in the following three steps:
This can be done in the following three steps:
1. Write a function `gold_spans()` that takes a DataFrame and returns a set of triples of the form `(sentence_id, start_position, end_position)`, one for each entity mention _in the dataset_.
1. Write a function `gold_spans()` that takes a DataFrame and returns a set of triples of the form `(sentence_id, start_position, end_position)`, one for each entity mention _in the dataset_.
2. Write a function `pred_spans()` that takes a DataFrame, runs spaCy’s NER on each sentence, and returns a set of triples (in the same form as above), one for each entity mention _predicted by spaCy_.
2. Write a function `pred_spans()` that takes a DataFrame, runs spaCy’s NER on each sentence, and returns a set of triples (in the same form as above), one for each entity mention _predicted by spaCy_.
3. Evaluate the results using your function from Problem 1.
3. Evaluate the results using your function from Problem 1.
We ask you to implement `gold_spans()` and `pred_spans()` as _generator functions_ that “yield” a single triple at a time, and provide stubs of such functions below that you can use as a starting point. (If you're not familiar with the `yield` keyword in Python, check out [this brief explanation](https://www.nbshare.io/notebook/851988260/Python-Yield/).)
We ask you to implement `gold_spans()` and `pred_spans()` as _generator functions_ that “yield” a single triple at a time, and provide stubs of such functions below that you can use as a starting point. (If you're not familiar with the `yield` keyword in Python, check out [this brief explanation](https://www.nbshare.io/notebook/851988260/Python-Yield/).)
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
def gold_spans(df):
def gold_spans(df):
"""Yield the gold-standard mention spans in a data frame.
"""Yield the gold-standard mention spans in a data frame.
Arguments:
Arguments:
df: A data frame.
df: A data frame.
Yields:
Yields:
The gold-standard mention spans in the specified data frame as
The gold-standard mention spans in the specified data frame as
triples consisting of the sentence id, start position, and end
triples consisting of the sentence id, start position, and end
position of each span.
position of each span.
"""
"""
# Hint: The Pandas method .itertuples() is useful for iterating over rows in a DataFrame
# Hint: The Pandas method .itertuples() is useful for iterating over rows in a DataFrame
for row in df.itertuples():
for row in df.itertuples():
yield row[1], row[3], row[4]
yield row[1], row[3], row[4]
```
```
%% Cell type:code id: tags:solution
%% Cell type:code id: tags:solution
``` python
``` python
def pred_spans(df):
def pred_spans(df):
"""Run and evaluate spaCy's NER.
"""Run and evaluate spaCy's NER.
Arguments:
Arguments:
df: A data frame.
df: A data frame.
Yields:
Yields:
The predicted mention spans in the specified data frame as
The predicted mention spans in the specified data frame as
triples consisting of the sentence id, start position, and end
triples consisting of the sentence id, start position, and end
position of each span.
position of each span.
"""
"""
ner = nlp.get_pipe("ner")
ner = nlp.get_pipe("ner")
for row in df.itertuples():
for row in df.itertuples():
sentence = row[2]
sentence = row[2]
doc = nlp(sentence)
doc = nlp(sentence)
for ent in doc.ents:
for ent in doc.ents:
yield row[1], ent.start, ent.end
yield row[1], ent.start, ent.end
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
#### 🤞 Putting it all together
#### 🤞 Putting it all together
The following cell shows how you can put it all together and produce the evaluation report, provided you have implemented the functions as generator functions. You should get a precision above 50%, with a recall above 70%, and an F1-score above 60%.
The following cell shows how you can put it all together and produce the evaluation report, provided you have implemented the functions as generator functions. You should get a precision above 50%, with a recall above 70%, and an F1-score above 60%.
assert scores[0] > .50, "Precision should be above 50%."
assert scores[0] > .50, "Precision should be above 50%."
assert scores[1] > .70, "Recall should be above 70%."
assert scores[1] > .70, "Recall should be above 70%."
success()
success()
```
```
%% Output
%% Output
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Problem 3: Error analysis
## Problem 3: Error analysis
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
As you can see in Problem 2, the span accuracy of the named entity recognizer is far from perfect. In particular, only slightly more than half of the predicted spans are correct according to the gold standard. Your next task is to analyse this result in more detail.
As you can see in Problem 2, the span accuracy of the named entity recognizer is far from perfect. In particular, only slightly more than half of the predicted spans are correct according to the gold standard. Your next task is to analyse this result in more detail.
Below is a function that uses spaCy’s span visualizer to visualize sentences containing _at least one mistake_ (i.e., either a false positive, a false negative, or both):
Below is a function that uses spaCy’s span visualizer to visualize sentences containing _at least one mistake_ (i.e., either a false positive, a false negative, or both):
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
from collections import defaultdict
from collections import defaultdict
from spacy import displacy
from spacy import displacy
from spacy.tokens import Span
from spacy.tokens import Span
def error_report(df, spans_gold, spans_pred):
def error_report(df, spans_gold, spans_pred):
"""Run and evaluate spaCy's NER.
"""Run and evaluate spaCy's NER.
Arguments:
Arguments:
df: A data frame.
df: A data frame.
spans_gold: The set of gold-standard entity spans from the data frame.
spans_gold: The set of gold-standard entity spans from the data frame.
spans_pred: The set of predicted entity spans from the data frame.
spans_pred: The set of predicted entity spans from the data frame.
Yields:
Yields:
The predicted mention spans in the specified data frame as
The predicted mention spans in the specified data frame as
triples consisting of the sentence id, start position, and end
triples consisting of the sentence id, start position, and end
position of each span.
position of each span.
"""
"""
gold_by_sid = defaultdict(set)
gold_by_sid = defaultdict(set)
for (sentence_id, span_s, span_e) in spans_gold:
for (sentence_id, span_s, span_e) in spans_gold:
gold_by_sid[sentence_id].add((span_s, span_e))
gold_by_sid[sentence_id].add((span_s, span_e))
pred_by_sid = defaultdict(set)
pred_by_sid = defaultdict(set)
for (sentence_id, span_s, span_e) in spans_pred:
for (sentence_id, span_s, span_e) in spans_pred:
pred_by_sid[sentence_id].add((span_s, span_e))
pred_by_sid[sentence_id].add((span_s, span_e))
for row in df.drop_duplicates('sentence_id').itertuples():
for row in df.drop_duplicates('sentence_id').itertuples():
if gold_by_sid[row.sentence_id] == pred_by_sid[row.sentence_id]:
if gold_by_sid[row.sentence_id] == pred_by_sid[row.sentence_id]:
continue
continue
doc = nlp(row.sentence)
doc = nlp(row.sentence)
doc.spans["sc"] = [
doc.spans["sc"] = [
Span(doc, span_s, span_e, "GOLD") for (span_s, span_e) in gold_by_sid[row.sentence_id]
Span(doc, span_s, span_e, "GOLD") for (span_s, span_e) in gold_by_sid[row.sentence_id]
] + [
] + [
Span(doc, span_s, span_e, "PRED") for (span_s, span_e) in pred_by_sid[row.sentence_id]
Span(doc, span_s, span_e, "PRED") for (span_s, span_e) in pred_by_sid[row.sentence_id]
]
]
yield doc
yield doc
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Let’s use a small sample of the training data to inspect this way. The following cell renders sentences containing mistakes that the automated prediction makes based on the _first 500 rows_ of the training data (you may have to click on “Show more outputs” at the bottom to see all of them):
Let’s use a small sample of the training data to inspect this way. The following cell renders sentences containing mistakes that the automated prediction makes based on the _first 500 rows_ of the training data (you may have to click on “Show more outputs” at the bottom to see all of them):
%% Cell type:code id: tags:solution
%% Cell type:code id: tags:solution
``` python
``` python
df_inspect = df_train[:500]
df_inspect = df_train[:500]
spans_inspect_pred = set(pred_spans(df_inspect))
spans_inspect_pred = set(pred_spans(df_inspect))
for doc in error_report(df_inspect, set(gold_spans(df_inspect)), spans_inspect_pred):
for doc in error_report(df_inspect, set(gold_spans(df_inspect)), spans_inspect_pred):
Can you see any patterns in the mistakes from the sample above? **Write a short text** that summarizes your observations!
Can you see any patterns in the mistakes from the sample above? **Write a short text** that summarizes your observations!
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We can see four types of mistakes that our model consistently makes. It labels points in time such as "tomorrow or Monday", and it labels numbers such as "16.4 percent" or "two". It also labels "the" in front of words such as "the European...". It also misses non english words.
We can see four types of mistakes that our model consistently makes. It labels points in time such as "tomorrow or Monday", and it labels numbers such as "16.4 percent" or "two". It also labels "the" in front of words such as "the European...". It also misses non english words.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Task 3.2
### Task 3.2
Based on your insights from the error analysis, you should be able to improve the automated prediction that you implemented in Problem 2. While the best way to do this would be to [update spaCy’s NER model](https://spacy.io/usage/linguistic-features#updating) using domain-specific training data, for this lab it suffices to **write code to post-process the output** produced by spaCy. To filter out specific labels it is useful to know the named entity label scheme, which can be found in the [model's documentation](https://spacy.io/models/en#en_core_web_sm).
Based on your insights from the error analysis, you should be able to improve the automated prediction that you implemented in Problem 2. While the best way to do this would be to [update spaCy’s NER model](https://spacy.io/usage/linguistic-features#updating) using domain-specific training data, for this lab it suffices to **write code to post-process the output** produced by spaCy. To filter out specific labels it is useful to know the named entity label scheme, which can be found in the [model's documentation](https://spacy.io/models/en#en_core_web_sm).
%% Cell type:code id: tags:solution
%% Cell type:code id: tags:solution
``` python
``` python
def pred_spans_improved(df):
def pred_spans_improved(df):
"""Run and evaluate spaCy's NER, with post-processing to improve the results.
"""Run and evaluate spaCy's NER, with post-processing to improve the results.
Arguments:
Arguments:
df: A data frame.
df: A data frame.
Yields:
Yields:
The predicted mention spans in the specified data frame as
The predicted mention spans in the specified data frame as
triples consisting of the sentence id, start position, and end
triples consisting of the sentence id, start position, and end
assert scores_improved[-1] > .8, "F1-score should be above 0.8"
assert scores_improved[-1] > .8, "F1-score should be above 0.8"
success()
success()
```
```
%% Output
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Task 3.3
### Task 3.3
Before moving on, we ask you to **store the outputs of the improved named entity recognizer in a new data frame**. This new frame should have the same layout as the original data frame for the _development data_ that you loaded above, but should contain the *predicted* start and end positions for each token span, rather than the gold positions. As the `label` of each span, you can use the special value `--NME--` for now.
Before moving on, we ask you to **store the outputs of the improved named entity recognizer in a new data frame**. This new frame should have the same layout as the original data frame for the _development data_ that you loaded above, but should contain the *predicted* start and end positions for each token span, rather than the gold positions. As the `label` of each span, you can use the special value `--NME--` for now.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
def df_with_pred_spans(df):
def df_with_pred_spans(df):
"""Make a new DataFrame with *predicted* NER spans.
"""Make a new DataFrame with *predicted* NER spans.
Arguments:
Arguments:
df: A data frame.
df: A data frame.
Returns:
Returns:
A *new* data frame with the same layout as `df`, but containing
A *new* data frame with the same layout as `df`, but containing
the predicted start and end positions for each token span.
the predicted start and end positions for each token span.
Run the following cell to run your function and display the first few lines of the new data frame:
Run the following cell to run your function and display the first few lines of the new data frame:
%% Cell type:code id: tags:solution
%% Cell type:code id: tags:solution
``` python
``` python
df_dev_pred = df_with_pred_spans(df_dev)
df_dev_pred = df_with_pred_spans(df_dev)
display(df_dev_pred.head())
display(df_dev_pred.head())
```
```
%% Output
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Problem 4: Entity linking
## Problem 4: Entity linking
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Now that we have a method for predicting mention spans, we turn to the task of **entity linking**, which amounts to predicting the knowledge base entity that is referenced by a given mention. In our case, for each span, we want to predict the Wikipedia page that this mention references.
Now that we have a method for predicting mention spans, we turn to the task of **entity linking**, which amounts to predicting the knowledge base entity that is referenced by a given mention. In our case, for each span, we want to predict the Wikipedia page that this mention references.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Task 4.1
### Task 4.1
Start by **extending the generator function** that you implemented in Problem 2 to **labelled spans**.
Start by **extending the generator function** that you implemented in Problem 2 to **labelled spans**.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
def gold_mentions(df):
def gold_mentions(df):
"""Yield the gold-standard mentions in a data frame.
"""Yield the gold-standard mentions in a data frame.
Args:
Args:
df: A data frame.
df: A data frame.
Yields:
Yields:
The gold-standard mention spans in the specified data frame as
The gold-standard mention spans in the specified data frame as
quadruples consisting of the sentence id, start position, end
quadruples consisting of the sentence id, start position, end
position and entity label of each span.
position and entity label of each span.
"""
"""
# YOUR CODE HERE
for row in df.itertuples():
raise NotImplementedError()
yield row[1], row[3], row[4], row[5]
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
#### 🤞 Test your code
#### 🤞 Test your code
To test your code, you can run the following cell, which checks if one of the expected tuples is included in the results:
To test your code, you can run the following cell, which checks if one of the expected tuples is included in the results:
%% Cell type:code id: tags:solution
%% Cell type:code id: tags:solution
``` python
``` python
dev_gold_mentions = set(gold_mentions(df_dev))
dev_gold_mentions = set(gold_mentions(df_dev))
assert ('1094-020', 0, 1, 'Seattle_Mariners') in dev_gold_mentions, "An expected tuple is not included in the results"
assert ('1094-020', 0, 1, 'Seattle_Mariners') in dev_gold_mentions, "An expected tuple is not included in the results"
success()
success()
```
```
%% Output
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Task 4.2
### Task 4.2
A naive baseline for entity linking on our data set is to link each mention span to the Wikipedia page name that we get when we join the tokens in the span by underscores, as is standard in Wikipedia page names. Suppose, for example, that a span contains the two tokens
A naive baseline for entity linking on our data set is to link each mention span to the Wikipedia page name that we get when we join the tokens in the span by underscores, as is standard in Wikipedia page names. Suppose, for example, that a span contains the two tokens
Jimi Hendrix
Jimi Hendrix
The baseline Wikipedia page name for this span would be
The baseline Wikipedia page name for this span would be
Jimi_Hendrix
Jimi_Hendrix
**Implement this naive baseline and evaluate its performance!**
**Implement this naive baseline and evaluate its performance!**
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
**_Important:_** Here and in the remainder of this lab, you should base your experiments on the _predicted spans_ that you computed in Problem 3.
**_Important:_** Here and in the remainder of this lab, you should base your experiments on the _predicted spans_ that you computed in Problem 3.
%% Cell type:code id: tags:solution
%% Cell type:code id: tags:solution
``` python
``` python
def baseline(df):
def baseline(df):
"""A naive baseline for entity linking that "predicts" Wikipedia
"""A naive baseline for entity linking that "predicts" Wikipedia
page names from the tokens in the mention span.
page names from the tokens in the mention span.
Arguments:
Arguments:
df: A data frame.
df: A data frame.
Yields:
Yields:
The predicted mention spans in the specified data frame as
The predicted mention spans in the specified data frame as
quadruples consisting of the sentence id, start position, end
quadruples consisting of the sentence id, start position, end
position and the predicted entity label of each span.
position and the predicted entity label of each span.
Again, we can turn to the evaluation measures that we implemented in Problem 1. The expected precision should be around 29%, with an F1-score around 28%.
Again, we can turn to the evaluation measures that we implemented in Problem 1. The expected precision should be around 29%, with an F1-score around 28%.
assert scores[0] > .28, "Precision should be above 28%"
assert scores[0] > .28, "Precision should be above 28%"
assert scores[-1] > .27, "F1-score should be above 27%"
assert scores[-1] > .27, "F1-score should be above 27%"
success()
success()
```
```
%% Output
Precision: 0.301, Recall: 0.274, F1: 0.287
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Problem 5: Extending the training data using the knowledge base
## Problem 5: Extending the training data using the knowledge base
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
State-of-the-art approaches to entity linking exploit information in knowledge bases. In our case, where Wikipedia is the knowledge base, one particularly useful type of information are links to other Wikipedia pages. In particular, we can interpret the anchor texts (the highlighted texts that you click on) as mentions of the entities (pages) that they link to. This allows us to harvest long lists of mention–entity pairings.
State-of-the-art approaches to entity linking exploit information in knowledge bases. In our case, where Wikipedia is the knowledge base, one particularly useful type of information are links to other Wikipedia pages. In particular, we can interpret the anchor texts (the highlighted texts that you click on) as mentions of the entities (pages) that they link to. This allows us to harvest long lists of mention–entity pairings.
The following cell loads a data frame summarizing anchor texts and page references harvested from the first paragraphs of the English Wikipedia. The data frame also contains all entity mentions in the training data (but not the development or the test data).
The following cell loads a data frame summarizing anchor texts and page references harvested from the first paragraphs of the English Wikipedia. The data frame also contains all entity mentions in the training data (but not the development or the test data).
To understand what information is available in this data, the following cell shows the entry for the anchor text `Sweden`.
To understand what information is available in this data, the following cell shows the entry for the anchor text `Sweden`.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
df_kb.loc[df_kb.mention == 'Sweden']
df_kb.loc[df_kb.mention == 'Sweden']
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
As you can see, each row of the data frame contains a pair $(m, e)$ of a mention $m$ and an entity $e$, as well as the conditional probability $P(e|m)$ for mention $m$ referring to entity $e$. These probabilities were estimated based on the frequencies of mention–entity pairs in the knowledge base. The example shows that the anchor text ‘Sweden’ is most often used to refer to the entity [Sweden](http://en.wikipedia.org/wiki/Sweden), but in a few cases also to refer to Sweden’s national football and ice hockey teams. Note that references are sorted in decreasing order of probability, so that the most probable pairing come first.
As you can see, each row of the data frame contains a pair $(m, e)$ of a mention $m$ and an entity $e$, as well as the conditional probability $P(e|m)$ for mention $m$ referring to entity $e$. These probabilities were estimated based on the frequencies of mention–entity pairs in the knowledge base. The example shows that the anchor text ‘Sweden’ is most often used to refer to the entity [Sweden](http://en.wikipedia.org/wiki/Sweden), but in a few cases also to refer to Sweden’s national football and ice hockey teams. Note that references are sorted in decreasing order of probability, so that the most probable pairing come first.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
**Implement an entity linking method** that resolves each mention to the most probable entity in the data frame. If the mention is not included in the data frame, you can predict the generic label `--NME--`.
**Implement an entity linking method** that resolves each mention to the most probable entity in the data frame. If the mention is not included in the data frame, you can predict the generic label `--NME--`.
%% Cell type:code id: tags:solution
%% Cell type:code id: tags:solution
``` python
``` python
def most_probable_method(df, df_kb):
def most_probable_method(df, df_kb):
"""An entity linker that resolves each mention to the most probably entity in a knowledge base.
"""An entity linker that resolves each mention to the most probably entity in a knowledge base.
Arguments:
Arguments:
df: A data frame containing the mention spans.
df: A data frame containing the mention spans.
df_kb: A data frame containing the knowledge base.
df_kb: A data frame containing the knowledge base.
Yields:
Yields:
The predicted mention spans in the specified data frame as
The predicted mention spans in the specified data frame as
quadruples consisting of the sentence id, start position, end
quadruples consisting of the sentence id, start position, end
position and the predicted entity label of each span.
position and the predicted entity label of each span.
"""
"""
# YOUR CODE HERE
# YOUR CODE HERE
raise NotImplementedError()
raise NotImplementedError()
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### 🤞 Test your code
### 🤞 Test your code
We run the same evaluation as before. The expected precision should now be around 65%, with an F1-score just around 59%.
We run the same evaluation as before. The expected precision should now be around 65%, with an F1-score just around 59%.
assert scores[0] > .64, "Precision should be above 64%"
assert scores[0] > .64, "Precision should be above 64%"
assert scores[-1] > .58, "F1-score should be above 58%"
assert scores[-1] > .58, "F1-score should be above 58%"
success()
success()
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Problem 6: Context-sensitive disambiguation
## Problem 6: Context-sensitive disambiguation
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Consider the entity mention ‘Lincoln’. The most probable entity for this mention turns out to be [Lincoln, Nebraska](http://en.wikipedia.org/Lincoln,_Nebraska); but in pages about American history, we would be better off to predict [Abraham Lincoln](http://en.wikipedia.org/Abraham_Lincoln). This suggests that we should try to disambiguate between different entity references based on the textual context on the page from which the mention was taken. Your task in this last problem is to implement this idea.
Consider the entity mention ‘Lincoln’. The most probable entity for this mention turns out to be [Lincoln, Nebraska](http://en.wikipedia.org/Lincoln,_Nebraska); but in pages about American history, we would be better off to predict [Abraham Lincoln](http://en.wikipedia.org/Abraham_Lincoln). This suggests that we should try to disambiguate between different entity references based on the textual context on the page from which the mention was taken. Your task in this last problem is to implement this idea.
Set up a dictionary that contains, for each mention $m$ that can refer to more than one entity $e$, a separate Naive Bayes classifier that is trained to predict the correct entity $e$, given the textual context of the mention. As the prior probabilities of the classifier, choose the probabilities $P(e|m)$ that you used in Problem 5. To let you estimate the context-specific probabilities, we have compiled a data set with mention contexts:
Set up a dictionary that contains, for each mention $m$ that can refer to more than one entity $e$, a separate Naive Bayes classifier that is trained to predict the correct entity $e$, given the textual context of the mention. As the prior probabilities of the classifier, choose the probabilities $P(e|m)$ that you used in Problem 5. To let you estimate the context-specific probabilities, we have compiled a data set with mention contexts:
This data frame contains, for each ambiguous mention $m$ and each knowledge base entity $e$ to which this mention can refer, up to 100 randomly selected contexts in which $m$ is used to refer to $e$. For this data, a **context** is defined as the 5 tokens to the left and the 5 tokens to the right of the mention. Here are a few examples:
This data frame contains, for each ambiguous mention $m$ and each knowledge base entity $e$ to which this mention can refer, up to 100 randomly selected contexts in which $m$ is used to refer to $e$. For this data, a **context** is defined as the 5 tokens to the left and the 5 tokens to the right of the mention. Here are a few examples:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
df_contexts.head()
df_contexts.head()
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that, in each context, the position of the mention is indicated by the `@` symbol.
Note that, in each context, the position of the mention is indicated by the `@` symbol.
From this data frame, it is easy to select the data that you need to train the classifiers – the contexts and corresponding entities for all mentions. To illustrate this, the following cell shows how to select all contexts that belong to the mention ‘Lincoln’:
From this data frame, it is easy to select the data that you need to train the classifiers – the contexts and corresponding entities for all mentions. To illustrate this, the following cell shows how to select all contexts that belong to the mention ‘Lincoln’:
Implement the context-sensitive disambiguation method and evaluate its performance. Do this in two parts, first implementing a function that builds the classifiers _(refer to the text above for a detailed description)_, then implementing a prediction function that uses these classifiers to perform the entity prediction.
Implement the context-sensitive disambiguation method and evaluate its performance. Do this in two parts, first implementing a function that builds the classifiers _(refer to the text above for a detailed description)_, then implementing a prediction function that uses these classifiers to perform the entity prediction.
Here are some more **hints** that may help you along the way:
Here are some more **hints** that may help you along the way:
1. The prior probabilities for a Naive Bayes classifier can be specified using the `class_prior` option. You will have to provide the probabilities in the same order as the alphabetically sorted class (entity) names.
1. The prior probabilities for a Naive Bayes classifier can be specified using the `class_prior` option. You will have to provide the probabilities in the same order as the alphabetically sorted class (entity) names.
2. Not all mentions in the knowledge base are ambiguous, and therefore not all mentions have context data. If a mention has only one possible entity, pick that one. If a mention has no entity at all, predict the `--NME--` label.
2. Not all mentions in the knowledge base are ambiguous, and therefore not all mentions have context data. If a mention has only one possible entity, pick that one. If a mention has no entity at all, predict the `--NME--` label.
%% Cell type:code id: tags:solution
%% Cell type:code id: tags:solution
``` python
``` python
def build_entity_classifiers(df_kb, df_contexts):
def build_entity_classifiers(df_kb, df_contexts):
"""Build Naive Bayes classifiers for entity prediction.
"""Build Naive Bayes classifiers for entity prediction.
Arguments:
Arguments:
df_kb: A data frame with the knowledge base.
df_kb: A data frame with the knowledge base.
df_contexts: A data frame with contexts for each mention.
df_contexts: A data frame with contexts for each mention.
Returns:
Returns:
A dictionary where the keys are mentions and the values are Naive Bayes
A dictionary where the keys are mentions and the values are Naive Bayes
classifiers trained to predict the correct entity, given the textual
classifiers trained to predict the correct entity, given the textual
context of the mention (as described in detail above).
context of the mention (as described in detail above).
Finally, the cell below evaluates the results as before. You should expect to see a small (around 1 unit) increase in each of precision, recall, and F1.
Finally, the cell below evaluates the results as before. You should expect to see a small (around 1 unit) increase in each of precision, recall, and F1.
<strong>After you have solved the lab,</strong> write a <em>brief</em> reflection (max. one A4 page) on the question(s) below. Remember:
<strong>After you have solved the lab,</strong> write a <em>brief</em> reflection (max. one A4 page) on the question(s) below. Remember:
<ul>
<ul>
<li>You are encouraged to discuss this part with your lab partner, but you should each write up your reflection <strong>individually</strong>.</li>
<li>You are encouraged to discuss this part with your lab partner, but you should each write up your reflection <strong>individually</strong>.</li>
<li><strong>Do not put your answers in the notebook</strong>; upload them in the separate submission opportunity for the reflections on Lisam.</li>
<li><strong>Do not put your answers in the notebook</strong>; upload them in the separate submission opportunity for the reflections on Lisam.</li>
</ul>
</ul>
</div>
</div>
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
1. In Problem 3, you performed an error analysis and implemented some post-processing to improve the model’s evaluation scores. How could you improve the model’s performance further, and what kind of resources (such as data, compute, etc.) would you need for that? Discuss this based on two or three concrete examples from the error analysis.
1. In Problem 3, you performed an error analysis and implemented some post-processing to improve the model’s evaluation scores. How could you improve the model’s performance further, and what kind of resources (such as data, compute, etc.) would you need for that? Discuss this based on two or three concrete examples from the error analysis.
2. How does the “context” data from Problem 6 help to disambiguate between different entities? Can you think of other types of “context” that you could use for disambiguation? Illustrate this with a specific example.
2. How does the “context” data from Problem 6 help to disambiguate between different entities? Can you think of other types of “context” that you could use for disambiguation? Illustrate this with a specific example.
➡️ Before you submit, **make sure the notebook can be run from start to finish** without errors. For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).
➡️ Before you submit, **make sure the notebook can be run from start to finish** without errors. For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).