diff --git a/lab3/lab-notes.md b/lab3/lab-notes.md
index f521d14aea8202743d92a46b7ba2a5d32a39cb4e..babd8262e1495c7dbda1160f6a6ff414b53a00c2 100644
--- a/lab3/lab-notes.md
+++ b/lab3/lab-notes.md
@@ -1,39 +1,172 @@
-Assignment 1
+# Lab 2
 
-- What is the kernel trick?
-Since we can rewrite the 𝐿^2 regularised linear regression formula to a form where non-linear transformations 𝝓(x) only appear via inner product, we do not have to design a 𝑑-dimensional vector 𝝓(x) and derive its inner product. Instead, we can just choose a kernel
-𝜅(x, x') directly where the kernel is the inner product of two non-linear input transformations according to:
-𝜅(x, x') = 𝝓(x)^T𝝓(x').
-This is known as the kernel trick:
-If x enters the model as 𝝓(x)^T𝝓(x') only, we can choose a kernel 𝜅(x, x') instead of chosing 𝝓(x). p. 194
+## Assignment 1
 
-- In the literature, it is common to see a formulation of SVMs that makes use of a hyperparameter. What is the purpose of this hyperparameter?
-The hyperparameter C is the regularization term in the dual formulation of SVMs:
-\[
-\alpha = \arg \min_\alpha \left( \frac{1}{2} \alpha^T K(X, X) \alpha - \alpha^T y \right)
-\]
-\[
-\text{subject to } \lvert \alpha_i \rvert \leq \frac{1}{2n\lambda} \quad \text{and} \quad 0 \leq \alpha_i y
-\]
-with \[y(x^\star) = \operatorname{sign} \left( b + \alpha^T K(X, x^\star) \right)\].
-Here \[C = \frac{1}{2n\lambda}\]. p. 211
+- _What is the kernel trick?_
 
+  Since we can rewrite the $L^2$ regularised linear regression formula to a form where non-linear transformations $\phi(x)$ only appear via inner product, we do not have to design a $d$-dimensional vector $\phi(x)$ and derive its inner product. Instead, we can just choose a kernel
+  $K(x, x')$ directly where the kernel is the inner product of two non-linear input transformations according to:
+  $K(x, x') = \phi(x)^T\phi(x')$.
+  This is known as the kernel trick:
+  If $x$ enters the model as $\phi(x)^T\phi(x')$ only, we can choose a kernel $K(x, x')$ instead of chosing $\phi(x)$. [p. 194]
 
-- In neural networks, what do we mean by mini-batch and epoch?
-We call a small subsample of data a mini-batch, which typically can contain 𝑛𝑏 = 10, 𝑛𝑏 = 100, or 𝑛𝑏 = 1 000
-data points. One complete pass through the training data is called an epoch, and consequently consists of 𝑛/𝑛𝑏 iterations. p. 125
+- _In the literature, it is common to see a formulation of SVMs that makes use of a hyperparameter. What is the purpose of this hyperparameter?_
 
+  The hyperparameter C is the regularization term in the dual formulation of SVMs:
 
+  $$
+  \alpha = \arg \min_\alpha \frac{1}{2} \alpha^T \mathbf{K(X, X)}\alpha - \alpha^T \mathbf{y}
+  $$
 
-Assignment 4
+  $$
+  \text{subject to } | \alpha_i | \leq \frac{1}{2n\lambda} \text{ and } 0 \leq \alpha_i y_i
+  $$
 
-4.1
-Results look good. reed curve is almost the same as blue. 10 hidden units seem to be quite suffiecient. Some off points between 5 and 7.
+  with
 
-4.2
-h1 gives a very bad predictions of e learned NN on the test data.
+  $$
+  \hat{y}(\mathbf{x_\star}) = \operatorname{sign} \left(\hat{\alpha}^T \mathbf{K(X, x_\star)} \right).
+  $$
 
-h2: The ReLU function does not have defined derivative when max(0,x) is used. Instead ifelse(x>0,x,0) is used. The prediction is quite good for Var < 4 but then off.
+  In this case
 
-h3: Good predctions for all Var, but not as good as sigmoid as activation function.
+  $$
+  C = \frac{1}{2n\lambda}.
+  $$
 
+  [p. 211]
+
+- _In neural networks, what do we mean by mini-batch and epoch?_
+
+  We call a small subsample of data a mini-batch, which typically can contain $nb = 10$, $nb = 100$, or $nb = 1000$
+  data points. One complete pass through the training data is called an epoch, and consequently consists of $n/nb$ iterations. [p. 125]
+
+## Assignment 2
+
+Utilizing gaussian kernel given by
+
+$$
+k(x,x') = \exp(\frac{-\|x-x'\|^2}{2h^2})
+$$
+
+where $\|\cdot\|$ is the Euclidian norm.
+
+Gaussian kernels for different features:
+
+- Physical distance ($h = 100000$)
+
+  ![assignment1-physical-distance-kernel](./figures/assignment2-physical-distance-kernel.png)
+
+- Date ($h = 15$)
+
+  ![assignment1-date-kernel](./figures/assignment2-date-kernel.png)
+
+- Time ($h = 4$)
+
+  ![assignment1-time-kernel](./figures/assignment2-time-kernel.png)
+
+For given coordinates, date or time the distance was calculated to each data point. Then kernels where constructed as $\mathbf{k}(\mathbf{x},x_\star)$ yielding a vector of kernels for each feature.
+
+These kernels where then combined in two separate ways, first by summing the kernels and then multiplying the kernels together.
+
+For each method the resulting kernels where used to predict temperature by using the following formula:
+
+$$
+t_{predicted} = \frac{\sum_ik_it_i}{\sum_it_i}
+$$
+
+where $k_i$ is the total kernel value for each data point and $t_i$ is the air temperature measure for each data point.
+
+Values $\texttt{latitude = 58.2357}$, $\texttt{longitude = 15.3437}$ and $\texttt{date = "1974-12-18"}$ yielded the following result:
+
+- Added kernels
+
+  ![assignment1-added-kernels](./figures/assignment2-added-kernels.png)
+
+- Multiplied kernels
+
+  ![assignment1-multiplied-kernels](./figures/assignment2-multiplied-kernels.png)
+
+A major reason as to why the results differ is due to how each approach handles uncertainty. In the approach of adding kernels one do not care for certainty, if say the kernel value for physical distance is high but instead low for date, the data point is still considered significant. If instead multiplication is used the low date value will cause the total value to be reduced. In summary for the multiplication method a data point needs to be significant for all features to be significant in total.
+
+## Assignment 3
+
+Errors for each filter:
+
+- $E_0 = 0.1650$ (training data, validation error)
+- $E_1 = 0.1673$ (training data, test error)
+- $E_2 = 0.1498$ (trainig and validation data, test error)
+- $E_3 = 0.01373$ (all data, test error)
+
+1. The hyperparameter $C$ is chosen for a model constructed training data by considering the validation error. Thus the parameter is the best possible to reduce expected new data error for the given training data. Both $\texttt{filter0}$ and $\texttt{filter1}$ are constructed in the same way with the same training data and yield different error due to testing on different data. $\texttt{filter2}$ is trained on both training and validation data with the original hyperparameter. This does reduce the expected new data error, since we use more data. $\texttt{filter3}$ is trained on the entire dataset for which we get a low test error. This is to be expected since the test data is used to train the model, meaning we most likely have overfitting.
+
+   The model to be returned to the user is the one with the smallest expected generalization error, which is given by $\texttt{filter2}$. One could argue that this model risks overfitting since the same data used to select hyperparameter is used for training. However since we are using completely new data to estimate performance, a lower error does not mean that the model is too complex. Instead the utilization of more data results in a more robust model that better represents unseen data. We can expect $\texttt{filter2}$ to generalize the best for unseen data.
+
+2. The estimate of the generalization error of the filter returned to the user is then $E_2$ ($\texttt{err2}$). It reflects the performance of a model trained on a more diverse dataset (both training and validation) and evaluated on unseen data.
+
+3. Once the Support Vector Machine (SVM) has been fitted to data, a new point can be constructed as a linear combination of kernel values between support vectors and a new point. This can be expressed as
+
+   $$
+   \hat{y}(\mathbf{x_\star}) = \text{sign}\left(b + \sum_j \hat{\alpha}_j K(\mathbf{x}_j, \mathbf{x_\star})\right)
+   $$
+
+   where $\mathbf{x_\star}$ is a new points, $\mathbf{x_i}$ is the $i$:th support vector, $b$ a bias term, $\hat{\alpha}_i$ the $i$:th dual variable and $K(\cdot)$ a kernel function.
+
+   For the first ten datapoints we get the following linnear combinations ($\hat{y}(\mathbf{x_\star})$ without $\text{sign}$):
+
+   | i   | linear cobination |
+   | --- | ----------------- |
+   | 1   | -1.070297         |
+   | 2   | 1.000345          |
+   | 3   | 0.9995908         |
+   | 4   | -0.9999648        |
+   | 5   | -0.9995379        |
+   | 6   | 1.000061          |
+   | 7   | -0.8585873        |
+   | 8   | -0.9997047        |
+   | 9   | 0.9998209         |
+   | 10  | -1.000097         |
+
+   This is the same result as for utilizing the built in function for predicting using a SVM.
+
+## Assignment 4
+
+1. Logistic activation function $h(x) = \frac{1}{1 + e^{-x}}$. Good results, the predicted test curve just about follows the test curve. There are a few points of in the range $x \in [5,7]$.
+
+   ![assignment4-1](./figures/assignment4-1.png)
+
+2. Different activation functions:
+
+- Linnear: $h_1(x)=x$, poor performance does not follow at all. Since the activation function is linear we can not make nonlinear predictions.
+
+  ![assignment4-2-1](./figures/assignment4-2-1.png)
+
+- ReLU: $h_2(x) = \max\{0,x\}$, decent performance for $x \in [0.5,4]$ but not outside.
+
+  ![assignment4-2-1](./figures/assignment4-2-2.png)
+
+- Softplus: $h_3(x) = \ln(1 + \exp x)$, very good performance. Better performance than for sigmoid.
+
+  ![assignment4-2-1](./figures/assignment4-2-3.png)
+
+3. The model performs well for $x \leq 10$ but then the prediction starts diverging from the observed test points. This is due to the model being fitted to just $x$ up until $10$ resulting in overfitting, i.e. we do not generalize well beyond the training.
+
+   ![assignment4-3](./figures/assignment4-3.png)
+
+4. For larger $x$ the model converges towards a value. This due to two things. Firstly the weights of the trained model are rather large, meaning that for large $x$ the the absolute value of inputs to activation functions will be large. Secondly the nature of the logistic activation function which for large inputs we get get
+
+   $$
+   \lim_{x \to \infty} h(x) = 1
+   $$
+
+   and
+
+   $$
+   \lim_{x \to -\infty} h(x) = 0
+   $$
+
+   i.e. it itself converges for large inputs. The convergence of the neural network is a result of the activation functions converging towards $0$ or $1$.
+
+5. Performance is decent until the same value for $\sin(x)$ appears again. There is a many to one mapping between $x$ and $\sin(x)$. For example $\sin(0) = \sin(2\pi) = 0$. In order to predict $x$ from $\sin(x)$ one would have to make an infinite amount of predictions for each $\sin(x)$, which is not possible.
+
+   ![assignment4-5](./figures/assignment4-5.png)