You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A set of $N$ data points $(x_i, y_i)$, the goal is to find the best linear map $f: \mathbb{R}^2 \to \mathbb{R}^2$ such that $f(x) = mx + b$ fits the data points. In simpler terms, we assume the relation between the dependent variable $y$ and independent variable $x$ is linear and try finding the optimal $m$ and $b$ such that some error function is minimized.
Loss/Error Function
$$\begin{equation}
E = \frac{1}{N}\sum_{i = 1}^{N}(y_i - \widehat{y})^2
\end{equation}$$
Arrive at the desired $m$ and $b$ by updating these values following the direction of the greatest descent of this function. The learning rate$L$ has to be specified.
$$\begin{equation}
\bar{m} = m - L\frac{∂E}{∂m}
\end{equation}$$$$\begin{equation}
\bar{b} = b - L\frac{∂E}{∂b}
\end{equation}$$
Visualizing the Loss Function $E(m, b)$
Visualizing the Gradient Descent every 5 epochs
Since $E(m, b)$ has only a single global minima (A Paraboloid!), setting an appropriate learning rate$L$ is highly important. Too large a rate will eventually shoot off and skip the minima, which may or may not wobble down to the well upon even increasing the epochs.
Gradient Descent
For a multivariable function, in $N$ dimensions let's say, $F(\textbf{v})$ which is differentiable at a point $\mathbf{v}$, we say that $F(\mathbf{v})$ decreases fastest in the direction of negative of the gradient at that point $\mathbf{v}$ denoted by $-∇F(\mathbf{v})$.
Notice that the decrease in $F(\mathbf{v})$ is guaranteed only to the nearest well, which may or may not be the global minima. We may run for a specified number of epochs or terminate at a set tolerance too.
2-dimensional example
Let
$$F(\mathbf{v}) = F\left(\begin{bmatrix} x \\\ y \end{bmatrix}\right) = \sin(x)\cos(y)$$
and starting from an initial point $\mathbf{v}_0$, we may reach the nearest local minima as:
$$\begin{equation}
\begin{bmatrix} \bar x \\\ \bar y \end{bmatrix} = \begin{bmatrix} x \\\ y \end{bmatrix} - L \begin{bmatrix} \cos(x)\cos(y) \\\ -\sin(x)\sin(y) \end{bmatrix}
\end{equation}$$
$F(x, y) = \sin(x)\cos(y)$
Visualizing the Descent
Convolution
For two functions $f(x)$ and $g(x)$, the convolution function $f(x) * g(x)$ is defined as:
if $f$ has $N$ samples and $g$ has $M$ samples, then the convolved function has $N + M - 1$ samples. A basic rule: "flip any one of the functions, overlap it with the stationary one, multiply and add, and then traverse over."
Also, before proceeding with the convolution, the kernel must be flipped Left-Right and then Upside-Down
$$ker = \begin{bmatrix}a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} ⟶ \begin{bmatrix}c & b & a \\ f & e & d \\ i & h & g \end{bmatrix} ⟶ \begin{bmatrix}i & h & g \\ f & e & d \\ c & b & a \end{bmatrix} = ker'$$
This is achieved by:
ker_flipped=np.flipud(np.fliplr(ker))
fliplr denoting a left-right flip and flipud denoting a up-down flip.
Choose a stride of length 1 and perform the convolution as the dot product of kernel sized chunks of $A$ with the $ker$:
$$\begin{bmatrix}0 & 0 & 0 \\ 0 & 1 & 2 \\ 0 & 4 & 5 \end{bmatrix} \cdot \begin{bmatrix}i & h & g \\ f & e & d \\ c & b & a \end{bmatrix} = elt_1$$
$$\begin{bmatrix}0 & 0 & 0 \\ 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \cdot \begin{bmatrix}i & h & g \\ f & e & d \\ c & b & a \end{bmatrix} = elt_2$$
.
.
.
$$\begin{bmatrix}5 & 6 & 0 \\ 8 & 9 & 0 \\ 0 & 0 & 0 \end{bmatrix} \cdot \begin{bmatrix}i & h & g \\ f & e & d \\ c & b & a \end{bmatrix} = elt_N$$