When we implemented softmax regression, we represented each sentence as a count vector. For the CBOW classifier we will use a different representation where each vector contains the word ids of the tokens in the sentence. We pad shorter sentences with zeroes.
When we implemented softmax regression, we represented each sentence as a count vector. For the CBOW classifier we will use a different representation where each vector contains the word ids of the tokens in the sentence. We pad shorter sentences with zeroes.
As in the notebook on softmax regression, we first construct our vocabulary. Note that we reserve the word id 0 for padding.
As in the notebook on softmax regression, we first construct our vocabulary. Note that we reserve the word id 0 for padding.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
defmake_vocab(data):
defmake_vocab(data):
vocab={'<pad>':0}
vocab={'<pad>':0}
forsentence,labelindata:
forsentence,labelindata:
fortinsentence:
fortinsentence:
iftnotinvocab:
iftnotinvocab:
vocab[t]=len(vocab)
vocab[t]=len(vocab)
returnvocab
returnvocab
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We create the vocabulary from the training data:
We create the vocabulary from the training data:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
vocab=make_vocab(train_data)
vocab=make_vocab(train_data)
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Next we create our vector representation:
Next we create our vector representation:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
importtorch
importtorch
defvectorize(vocab,data):
defvectorize(vocab,data):
# All vectors will be right-padded up to the maximal length.
# All vectors will be right-padded up to the maximal length.
max_length=max(len(s)fors,_indata)
max_length=max(len(s)fors,_indata)
xs=[]
xs=[]
ys=[]
ys=[]
forsentence,labelindata:
forsentence,labelindata:
x=[0]*max_length
x=[0]*max_length
fori,winenumerate(sentence):
fori,winenumerate(sentence):
ifwinvocab:
ifwinvocab:
x[i]=vocab[w]
x[i]=vocab[w]
xs.append(x)
xs.append(x)
ys.append(label)
ys.append(label)
returntorch.LongTensor(xs),torch.LongTensor(ys)
returntorch.LongTensor(xs),torch.LongTensor(ys)
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We vectorize the training data and the development data:
We vectorize the training data and the development data:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
train_x,train_y=vectorize(vocab,train_data)
train_x,train_y=vectorize(vocab,train_data)
dev_x,dev_y=vectorize(vocab,dev_data)
dev_x,dev_y=vectorize(vocab,dev_data)
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The resulting tensors are much smaller than the corresponding tensors from the notebook on softmax regression:
The resulting tensors are much smaller than the corresponding tensors from the notebook on softmax regression:
Recall that our baseline (always predict the label 3) gives us an accuracy of slightly above 25%:
Recall that our baseline (always predict the label 3) gives us an accuracy of slightly above 25%:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
accuracy(torch.full_like(dev_y,3),dev_y)
accuracy(torch.full_like(dev_y,3),dev_y)
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Training the model
## Training the model
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We are now ready to set up the CBOW model and train it using cross-entropy loss.
We are now ready to set up the CBOW model and train it using cross-entropy loss.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
importtorch.nnasnn
importtorch.nnasnn
importtorch.nn.functionalasF
importtorch.nn.functionalasF
importtorch.optimasoptim
importtorch.optimasoptim
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
As with softmax regression, we will train our model using minibatch gradient descent.
As with softmax regression, we will train our model using minibatch gradient descent.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
defminibatches(x,y,batch_size):
defminibatches(x,y,batch_size):
random_indices=torch.randperm(x.size(0))
random_indices=torch.randperm(x.size(0))
foriinrange(0,x.size(0)-batch_size+1,batch_size):
foriinrange(0,x.size(0)-batch_size+1,batch_size):
batch_indices=random_indices[i:i+batch_size]
batch_indices=random_indices[i:i+batch_size]
yieldx[batch_indices],y[batch_indices]
yieldx[batch_indices],y[batch_indices]
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Model
### Model
Recall that a CBOW model consists of an embedding layer followed by an element-wise mean, a linear layer, and a final softmax. In PyTorch, embedding layers are implemented by the class [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). We use the parameter `padding_idx` to prevent the embedding for the special padding marker `<pad>` from being initialized and updated during training.
Recall that a CBOW model consists of an embedding layer followed by an element-wise mean, a linear layer, and a final softmax. In PyTorch, embedding layers are implemented by the class [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). We use the parameter `padding_idx` to prevent the embedding for the special padding marker `<pad>` from being initialized and updated during training.
For training, we can reuse essentially the same training loop as for softmax regression. The only major change is the use of the Adam optimizer instead of SGD. The Adam optimizer will be our default choice in later units.
For training, we can reuse essentially the same training loop as for softmax regression. The only major change is the use of the Adam optimizer instead of SGD. The Adam optimizer will be our default choice in later units.
The accuracy of our CBOW classifier is higher than what [Socher et al., 2013](https://www.aclweb.org/anthology/D13-1170) reported. (The relevant point of comparison is the accuracy of their VecAvg method in the fine-grained setting, as reported in their Table 1.) However, the accuracy is still a bit *below* that of the classifier that we implemented in the notebook on softmax regression. On the pro side, the CBOW classifier uses much more compact vectors.
The accuracy of our CBOW classifier is higher than what [Socher et al., 2013](https://www.aclweb.org/anthology/D13-1170) reported. (The relevant point of comparison is the accuracy of their VecAvg method in the fine-grained setting, as reported in their Table 1.) However, the accuracy is still a bit *below* that of the classifier that we implemented in the notebook on softmax regression. On the pro side, the CBOW classifier uses much more compact vectors.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
**🤔 Exploration 1: Add dropout**
**🤔 Exploration 1: Add dropout**
> Looking at the loss curve on the development data, it seems that the model starts to overfit after only a few epochs. One way to counteract this is to regularise the model by inserting a [**dropout layer**](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) between the mean and the final linear layer. Evaluate what effect this change has to the model accuracy. A standard dropout probability is 0.5.
> Looking at the loss curve on the development data, it seems that the model starts to overfit after only a few epochs. One way to counteract this is to regularise the model by inserting a [**dropout layer**](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) between the mean and the final linear layer. Evaluate what effect this change has to the model accuracy. A standard dropout probability is 0.5.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
**🤔 Exploration 2: Different random seeds**
**🤔 Exploration 2: Different random seeds**
> One issue related to the evaluation of neural networks is that results can vary significantly across different random seeds. Because of this, and resources permitting, we should always train and evaluate the model with several random seeds. Produce a box plot for the accuracy of the CBOW model over 10 random seeds. How large is the variance?
> One issue related to the evaluation of neural networks is that results can vary significantly across different random seeds. Because of this, and resources permitting, we should always train and evaluate the model with several random seeds. Produce a box plot for the accuracy of the CBOW model over 10 random seeds. How large is the variance?
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
**🤔 Exploration 3: Task-specific embeddings**
**🤔 Exploration 3: Task-specific embeddings**
> As mentioned in Lecture 1.3, the word embeddings that are learned by neural networks are tuned towards the specific task that the network is trained on. Use the code provided in Lab L1 to save the embeddings learned by the CBOW classifier and explore them in the [Embedding Projector](https://projector.tensorflow.org). Look at the neighbourhoods of sentiment-laden words. Does the result match your expectations?
> As mentioned in Lecture 1.3, the word embeddings that are learned by neural networks are tuned towards the specific task that the network is trained on. Use the code provided in Lab L1 to save the embeddings learned by the CBOW classifier and explore them in the [Embedding Projector](https://projector.tensorflow.org). Look at the neighbourhoods of sentiment-laden words. Does the result match your expectations?