When we implemented softmax regression, we represented each sentence as a count vector. For the CBOW classifier we will use a different representation where each vector contains the word ids of the tokens in the sentence. We pad shorter sentences with zeroes.
As in the notebook on softmax regression, we first construct our vocabulary. Note that we reserve the word id 0 for padding.
%% Cell type:code id: tags:
``` python
defmake_vocab(data):
vocab={'<pad>':0}
forsentence,labelindata:
fortinsentence:
iftnotinvocab:
vocab[t]=len(vocab)
returnvocab
```
%% Cell type:markdown id: tags:
We create the vocabulary from the training data:
%% Cell type:code id: tags:
``` python
vocab=make_vocab(train_data)
```
%% Cell type:markdown id: tags:
Next we create our vector representation:
%% Cell type:code id: tags:
``` python
importtorch
defvectorize(vocab,data):
# All vectors will be right-padded up to the maximal length.
max_length=max(len(s)fors,_indata)
xs=[]
ys=[]
forsentence,labelindata:
x=[0]*max_length
fori,winenumerate(sentence):
ifwinvocab:
x[i]=vocab[w]
xs.append(x)
ys.append(label)
returntorch.LongTensor(xs),torch.LongTensor(ys)
```
%% Cell type:markdown id: tags:
We vectorize the training data and the development data:
%% Cell type:code id: tags:
``` python
train_x,train_y=vectorize(vocab,train_data)
dev_x,dev_y=vectorize(vocab,dev_data)
```
%% Cell type:markdown id: tags:
The resulting tensors are much smaller than the corresponding tensors from the notebook on softmax regression:
Recall that our baseline (always predict the label 3) gives us an accuracy of slightly above 25%:
%% Cell type:code id: tags:
``` python
accuracy(torch.full_like(dev_y,3),dev_y)
```
%% Output
0.25340598821640015
%% Cell type:markdown id: tags:
## Training the model
%% Cell type:markdown id: tags:
We are now ready to set up the CBOW model and train it using cross-entropy loss.
%% Cell type:code id: tags:
``` python
importtorch.nnasnn
importtorch.nn.functionalasF
importtorch.optimasoptim
```
%% Cell type:markdown id: tags:
As with softmax regression, we will train our model using minibatch gradient descent.
%% Cell type:code id: tags:
``` python
defminibatches(x,y,batch_size):
random_indices=torch.randperm(x.size(0))
foriinrange(0,x.size(0)-batch_size+1,batch_size):
batch_indices=random_indices[i:i+batch_size]
yieldx[batch_indices],y[batch_indices]
```
%% Cell type:markdown id: tags:
### Model
Recall that a CBOW model consists of an embedding layer followed by an element-wise mean, a linear layer, and a final softmax. In PyTorch, embedding layers are implemented by the class [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). We use the parameter `padding_idx` to prevent the embedding for the special padding marker `<pad>` from being initialized and updated during training.
> When writing PyTorch code, it is often very useful to annotate tensor-valued variables with their shapes. What is the shape of the variable `x` when it is passed as an argument to `forward`? What is the shape of the output of each of the operations in the `forward` method?
%% Cell type:markdown id: tags:
### Training loop
For training, we can reuse essentially the same training loop as for softmax regression. The only major change is the use of the Adam optimizer instead of SGD. The Adam optimizer will be our default choice in later units.
The accuracy of our CBOW classifier is higher than what [Socher et al., 2013](https://www.aclweb.org/anthology/D13-1170) reported. (The relevant point of comparison is the accuracy of their VecAvg method in the fine-grained setting, as reported in their Table 1.) However, the accuracy is still a bit *below* that of the classifier that we implemented in the notebook on softmax regression. On the pro side, the CBOW classifier uses much more compact vectors.
%% Cell type:markdown id: tags:
**🤔 Exploration 1: Dropout**
> Looking at the loss curve on the development data, it seems that the model starts to overfit after only a few epochs. One way to counteract this is to regularise the model by inserting a [**dropout layer**](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) between the mean and the final linear layer. Evaluate what effect this change has to the model accuracy. A standard dropout probability is 0.5.
%% Cell type:markdown id: tags:
**🤔 Exploration 2: Early stopping**
> Another way to deal with overfitting is to stop the training early when the model accuracy no longer increases on the development (validation) data. Implement this strategy.
%% Cell type:markdown id: tags:
**🤔 Exploration 3: Different random seeds**
> One issue related to the evaluation of neural networks is that results can vary significantly across different random seeds. Because of this, and resources permitting, we should always train and evaluate the model with several random seeds. Produce a box plot for the accuracy of the CBOW model over 10 random seeds. How large is the variance?
%% Cell type:markdown id: tags:
**🤔 Exploration 4: Task-specific embeddings**
> As mentioned in Lecture 1.3, the word embeddings that are learned by neural networks are tuned towards the specific task that the network is trained on. Use the code provided in Lab L1 to save the embeddings learned by the CBOW classifier and explore them in the [Embedding Projector](https://projector.tensorflow.org). Look at the neighbourhoods of sentiment-laden words. Does the result match your expectations?